Spaces:

bugroup
/

GazeGenie

Running

App Files Files Community

hugpv commited on Sep 5, 2024

Commit

da572bf

1 Parent(s): 5c5f561

initial commit

Browse files

Files changed (38) hide show

.gitignore +162 -0
README.md +29 -13
analysis_funcs.py +338 -0
app.py +0 -0
chars_df_columns.md +24 -0
classic_correction_algos.py +552 -0
emreading_funcs.py +994 -0
eyekit_measures.py +194 -0
fixations_df_columns.md +88 -0
item_df_columns.md +4 -0
loss_functions.py +97 -0
models.py +892 -0
models/BERT_20240104-223349_loop_normalize_by_line_height_and_width_True_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00430.ckpt +3 -0
models/BERT_20240104-233803_loop_normalize_by_line_height_and_width_False_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00719.ckpt +3 -0
models/BERT_20240107-152040_loop_restrict_sim_data_to_4000_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00515.ckpt +3 -0
models/BERT_20240108-000344_loop_normalize_by_line_height_and_width_False_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00706.ckpt +3 -0
models/BERT_20240108-011230_loop_normalize_by_line_height_and_width_True_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00560.ckpt +3 -0
models/BERT_20240109-090419_loop_normalize_by_line_height_and_width_False_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00518.ckpt +3 -0
models/BERT_20240122-183729_loop_normalize_by_line_height_and_width_True_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00523.ckpt +3 -0
models/BERT_20240122-194041_loop_normalize_by_line_height_and_width_False_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00462.ckpt +3 -0
models/BERT_fin_exp_20240104-223349.yaml +100 -0
models/BERT_fin_exp_20240104-233803.yaml +100 -0
models/BERT_fin_exp_20240107-152040.yaml +100 -0
models/BERT_fin_exp_20240108-000344.yaml +100 -0
models/BERT_fin_exp_20240108-011230.yaml +100 -0
models/BERT_fin_exp_20240109-090419.yaml +100 -0
models/BERT_fin_exp_20240122-183729.yaml +102 -0
models/BERT_fin_exp_20240122-194041.yaml +102 -0
multi_proc_funcs.py +2415 -0
popEye_funcs.py +1373 -0
process_asc_files_in_multi_p.py +149 -0
requirements.txt +25 -0
saccades_df_columns.md +38 -0
sentence_measures.md +35 -0
subject_measures.md +15 -0
trials_df_columns.md +36 -0
utils.py +1349 -0
word_measures.md +58 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,162 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/

README.md CHANGED Viewed

@@ -1,13 +1,29 @@
----
-title: GazeGenie
-emoji: 🔥
-colorFrom: purple
-colorTo: purple
-sdk: streamlit
-sdk_version: 1.38.0
-app_file: app.py
-pinned: false
-license: unknown
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# GazeGenie
+A versatile tool for parsing, cleaning, aligning and analysing fixations from eye-tracking reading experiments
+## Use via huggingface spaces
+In Browser navigate to :
+## Run via Docker
+mkdir results
+docker run --name gazegenie_app -p 8501:8501 -v $pwd/results:/app/results dockinthehubbing/gaze_genie:latest
+In Browser navigate to : http://localhost:8501
+To restart container later:
+docker start -a gazegenie_app
+## Local installation
+#### Install conda to get python
+https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Windows-x86_64.exe
+#### Package installation in Terminal
+mamba create -n eye python=3.11 -y
+mamba activate eye
+mamba install conda-forge::cairo
+pip install -r requirements.txt
+#### Run program from Terminal
+conda activate eye
+streamlit run app.py
+In Browser navigate to : http://localhost:8501

analysis_funcs.py ADDED Viewed

	@@ -0,0 +1,338 @@

+"""
+Partially taken and adapted from: https://github.com/jwcarr/eyekit/blob/1db1913411327b108b87e097a00278b6e50d0751/eyekit/measure.py
+Functions for calculating common reading measures, such as gaze duration or initial landing position.
+"""
+import pandas as pd
+from icecream import ic
+ic.configureOutput(includeContext=True)
+def fix_in_ia(fix_x, fix_y, ia_x_min, ia_x_max, ia_y_min, ia_y_max):
+    in_x = ia_x_min <= fix_x <= ia_x_max
+    in_y = ia_y_min <= fix_y <= ia_y_max
+    if in_x and in_y:
+        return True
+    else:
+        return False
+def fix_in_ia_default(fixation, ia_row, prefix):
+    return fix_in_ia(
+        fixation.x,
+        fixation.y,
+        ia_row[f"{prefix}_xmin"],
+        ia_row[f"{prefix}_xmax"],
+        ia_row[f"{prefix}_ymin"],
+        ia_row[f"{prefix}_ymax"],
+    )
+def number_of_fixations_own(trial, dffix, prefix, correction_algo):
+    """
+    Return the number of fixations on that interest area.
+    """
+    ia_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    counts = []
+    for cidx, ia_row in ia_df.iterrows():
+        count = 0
+        for idx, fixation in dffix.iterrows():
+            if fix_in_ia(
+                fixation.x,
+                fixation.y,
+                ia_row[f"{prefix}_xmin"],
+                ia_row[f"{prefix}_xmax"],
+                ia_row[f"{prefix}_ymin"],
+                ia_row[f"{prefix}_ymax"],
+            ):
+                count += 1
+        counts.append(
+            {
+                f"{prefix}_number": cidx,
+                prefix: ia_row[f"{prefix}"],
+                f"number_of_fixations_{correction_algo}": count,
+            }
+        )
+    return pd.DataFrame(counts)
+def initial_fixation_duration_own(trial, dffix, prefix, correction_algo):
+    """
+    The duration of the initial fixation on that interest area for each word.
+    """
+    ia_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    durations = []
+    for cidx, ia_row in ia_df.iterrows():
+        initial_duration = 0
+        for idx, fixation in dffix.iterrows():
+            if fix_in_ia_default(fixation, ia_row, prefix):
+                initial_duration = fixation.duration
+                break  # Exit the loop after finding the initial fixation for the word
+        durations.append(
+            {
+                f"{prefix}_number": cidx,
+                prefix: ia_row[f"{prefix}"],
+                f"initial_fixation_duration_{correction_algo}": initial_duration,
+            }
+        )
+    return pd.DataFrame(durations)
+def first_of_many_duration_own(trial, dffix, prefix, correction_algo):
+    ia_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    durations = []
+    for cidx, ia_row in ia_df.iterrows():
+        fixation_durations = []
+        for idx, fixation in dffix.iterrows():
+            if fix_in_ia_default(fixation, ia_row, prefix):
+                fixation_durations.append(fixation.duration)
+        if len(fixation_durations) > 1:
+            durations.append(
+                {
+                    f"{prefix}_number": cidx,
+                    prefix: ia_row[f"{prefix}"],
+                    f"first_of_many_duration_{correction_algo}": fixation_durations[0],
+                }
+            )
+        else:
+            durations.append(
+                {
+                    f"{prefix}_number": cidx,
+                    prefix: ia_row[f"{prefix}"],
+                    f"first_of_many_duration_{correction_algo}": None,
+                }
+            )
+    if len(durations) > 0:
+        return pd.DataFrame(durations)
+    else:
+        return pd.DataFrame()
+def total_fixation_duration_own(trial, dffix, prefix, correction_algo):
+    """
+    sum duration of all fixations on that interest area.
+    """
+    ia_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    durations = []
+    for cidx, ia_row in ia_df.iterrows():
+        total_duration = 0
+        for idx, fixation in dffix.iterrows():
+            if fix_in_ia_default(fixation, ia_row, prefix):
+                total_duration += fixation.duration
+        durations.append(
+            {
+                f"{prefix}_number": cidx,
+                prefix: ia_row[f"{prefix}"],
+                f"total_fixation_duration_{correction_algo}": total_duration,
+            }
+        )
+    return pd.DataFrame(durations)
+def gaze_duration_own(trial, dffix, prefix, correction_algo):
+    """
+    Gaze duration is the sum duration of all fixations
+    inside an interest area until the area is exited for the first time.
+    """
+    ia_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    durations = []
+    for cidx, ia_row in ia_df.iterrows():
+        duration = 0
+        in_ia = False
+        for idx, fixation in dffix.iterrows():
+            if fix_in_ia_default(fixation, ia_row, prefix):
+                duration += fixation.duration
+                in_ia = True
+            elif in_ia:
+                break
+        durations.append(
+            {
+                f"{prefix}_number": cidx,
+                prefix: ia_row[f"{prefix}"],
+                f"gaze_duration_{correction_algo}": duration,
+            }
+        )
+    return pd.DataFrame(durations)
+def go_past_duration_own(trial, dffix, prefix, correction_algo):
+    """
+    Given an interest area and fixation sequence, return the go-past time on
+    that interest area. Go-past time is the sum duration of all fixations from
+    when the interest area is first entered until when it is first exited to
+    the right, including any regressions to the left that occur during that
+    time period (and vice versa in the case of right-to-left text).
+    """
+    ia_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    results = []
+    for cidx, ia_row in ia_df.iterrows():
+        entered = False
+        go_past_time = 0
+        for idx, fixation in dffix.iterrows():
+            if fix_in_ia_default(fixation, ia_row, prefix):
+                if not entered:
+                    entered = True
+                go_past_time += fixation.duration
+            elif entered:
+                if ia_row[f"{prefix}_xmax"] < fixation.x:  # Interest area has been exited to the right
+                    break
+                go_past_time += fixation.duration
+        results.append(
+            {f"{prefix}_number": cidx, prefix: ia_row[f"{prefix}"], f"go_past_duration_{correction_algo}": go_past_time}
+        )
+    return pd.DataFrame(results)
+def second_pass_duration_own(trial, dffix, prefix, correction_algo):
+    """
+    Given an interest area and fixation sequence, return the second pass
+    duration on that interest area for each word.
+    """
+    ia_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    durations = []
+    for cidx, ia_row in ia_df.iterrows():
+        current_pass = None
+        next_pass = 1
+        pass_duration = 0
+        for idx, fixation in dffix.iterrows():
+            if fix_in_ia_default(fixation, ia_row, prefix):
+                if current_pass is None:  # first fixation in a new pass
+                    current_pass = next_pass
+                if current_pass == 2:
+                    pass_duration += fixation.duration
+            elif current_pass == 1:  # first fixation to exit the first pass
+                current_pass = None
+                next_pass += 1
+            elif current_pass == 2:  # first fixation to exit the second pass
+                break
+        durations.append(
+            {
+                f"{prefix}_number": cidx,
+                prefix: ia_row[f"{prefix}"],
+                f"second_pass_duration_{correction_algo}": pass_duration,
+            }
+        )
+    return pd.DataFrame(durations)
+def initial_landing_position_own(trial, dffix, prefix, correction_algo):
+    """
+    initial landing position (expressed in character positions) on that interest area.
+    Counting is from 1. Returns `None` if no fixation
+    landed on the interest area.
+    """
+    ia_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    if prefix == "word":
+        chars_df = pd.DataFrame(trial[f"chars_list"])
+    else:
+        chars_df = None
+    results = []
+    for cidx, ia_row in ia_df.iterrows():
+        landing_position = None
+        for idx, fixation in dffix.iterrows():
+            if fix_in_ia_default(fixation, ia_row, prefix):
+                if prefix == "char":
+                    landing_position = 1
+                else:
+                    prefix_temp = "char"
+                    matched_chars_df = chars_df.loc[
+                        (chars_df.char_xmin >= ia_row[f"{prefix}_xmin"])
+                        & (chars_df.char_xmax <= ia_row[f"{prefix}_xmax"])
+                        & (chars_df.char_ymin >= ia_row[f"{prefix}_ymin"])
+                        & (chars_df.char_ymax <= ia_row[f"{prefix}_ymax"]),
+                        :,
+                    ]  # TODO need to find way to count correct letter number
+                    for char_idx, (rowidx, char_row) in enumerate(matched_chars_df.iterrows()):
+                        if fix_in_ia_default(fixation, char_row, prefix_temp):
+                            landing_position = char_idx + 1  # starts at 1
+                            break
+                break
+        results.append(
+            {
+                f"{prefix}_number": cidx,
+                prefix: ia_row[f"{prefix}"],
+                f"initial_landing_position_{correction_algo}": landing_position,
+            }
+        )
+    return pd.DataFrame(results)
+def initial_landing_distance_own(trial, dffix, prefix, correction_algo):
+    """
+    Given an interest area and fixation sequence, return the initial landing
+    distance on that interest area. The initial landing distance is the pixel
+    distance between the first fixation to land in an interest area and the
+    left edge of that interest area (or, in the case of right-to-left text,
+    the right edge). Technically, the distance is measured from the text onset
+    without including any padding. Returns `None` if no fixation landed on the
+    interest area.
+    """
+    ia_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    distances = []
+    for cidx, ia_row in ia_df.iterrows():
+        initial_distance = None
+        for idx, fixation in dffix.iterrows():
+            if fix_in_ia_default(fixation, ia_row, prefix):
+                distance = abs(ia_row[f"{prefix}_xmin"] - fixation.x)
+                if initial_distance is None:
+                    initial_distance = distance
+                    break
+        distances.append(
+            {
+                f"{prefix}_number": cidx,
+                prefix: ia_row[f"{prefix}"],
+                f"initial_landing_distance_{correction_algo}": initial_distance,
+            }
+        )
+    return pd.DataFrame(distances)
+def landing_distances_own(trial, dffix, prefix, correction_algo):
+    """
+    Given an interest area and fixation sequence, return a dataframe with
+    landing distances for each word in the interest area.
+    """
+    ia_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    distances = []
+    for cidx, ia_row in ia_df.iterrows():
+        landing_distances = []
+        for idx, fixation in dffix.iterrows():
+            if fix_in_ia_default(fixation, ia_row, prefix):
+                landing_distance = abs(ia_row[f"{prefix}_xmin"] - fixation.x)
+                landing_distances.append(round(landing_distance, ndigits=2))
+        distances.append(
+            {
+                f"{prefix}_number": cidx,
+                prefix: ia_row[f"{prefix}"],
+                f"landing_distances_{correction_algo}": landing_distances,
+            }
+        )
+    return pd.DataFrame(distances)
+def number_of_regressions_in_own(trial, dffix, prefix, correction_algo):
+    word_reg_in_count = (
+        dffix.groupby([f"on_{prefix}_number_{correction_algo}", f"on_{prefix}_{correction_algo}"])[
+            f"{prefix}_reg_in_{correction_algo}"
+        ]
+        .sum()
+        .reset_index()
+        .rename(
+            columns={
+                f"on_{prefix}_number_{correction_algo}": f"{prefix}_number",
+                f"{prefix}_reg_in_{correction_algo}": f"number_of_regressions_in_{correction_algo}",
+                f"on_{prefix}_{correction_algo}": prefix,
+            }
+        )
+    )
+    return word_reg_in_count

app.py ADDED Viewed

The diff for this file is too large to render. See raw diff

chars_df_columns.md ADDED Viewed

	@@ -0,0 +1,24 @@

+#### Column names for Character Dataframe
+- subject: Subject name or ID (derived from filename)
+- trial_id: Trial ID
+- condition: Condition (if applicable)
+- item: Item ID
+- index:
+- letternum: Number of the character
+- char: The character
+- char_xmin: x start position (in pixel)
+- char_ymin: y start position (in pixel)
+- char_xmax: x end position (in pixel)
+- char_ymax: y end position (in pixel)
+- char_y_center: x center position (in pixel)
+- char_x_center: y center position (in pixel)
+- assigned_line: Line of text the character belongs to
+- in_word_number: Number of word the character belongs to
+- in_word: Word the character belongs to
+- num_letters_from_start_of_word: Number of characters since the start of the word
+- in_sentence_number: Number of sentence the character belongs to
+- in_sentence: Sentence the character belongs to
+- letline: Character position from start of line
+- wordline: Word position from start of line for the word the character belongs to
+- wordsent: Word position from start of the sentence for the word the character belongs to
+- letword: Character position from start of word starting from space before word

classic_correction_algos.py ADDED Viewed

	@@ -0,0 +1,552 @@

+"""
+Mostly adapted from https://github.com/jwcarr/eyekit/blob/350d055eecaa1581b03db5a847424825ffbb10f6/eyekit/_snap.py
+"""
+import os
+import numpy as np
+from sklearn.cluster import KMeans
+from icecream import ic
+ic.configureOutput(includeContext=True)
+os.environ["OMP_NUM_THREADS"] = "1"  # Prevents KMeans memory leak on windows
+def apply_classic_algo(
+    dffix,
+    trial,
+    algo="slice",
+    algo_params=dict(x_thresh=192, y_thresh=32, w_thresh=32, n_thresh=90),
+):
+    fixation_array = dffix.loc[:, ["x", "y"]].values
+    y_diff = trial["y_diff"]
+    if "y_char_unique" in trial:
+        midlines = trial["y_char_unique"]
+    else:
+        midlines = trial["y_midline"]
+    if len(midlines) == 1:
+        corrected_fix_y_vals = np.ones((fixation_array.shape[0])) * midlines[0]
+    elif fixation_array.shape[0] <= 2:
+        corrected_fix_y_vals = np.ones((fixation_array.shape[0])) * midlines[0]
+    else:
+        if algo == "slice":
+            corrected_fix_y_vals = slice(fixation_array, midlines, line_height=y_diff, **algo_params)
+        elif algo == "warp":
+            word_center_list = [(word["word_x_center"], word["word_y_center"]) for word in trial["words_list"]]
+            corrected_fix_y_vals = warp(fixation_array, word_center_list)
+        elif algo == "chain":
+            corrected_fix_y_vals = chain(fixation_array, midlines, **algo_params)
+        elif algo == "cluster":
+            corrected_fix_y_vals = cluster(fixation_array, midlines)
+        elif algo == "merge":
+            corrected_fix_y_vals = merge(fixation_array, midlines, **algo_params)
+        elif algo == "regress":
+            corrected_fix_y_vals = regress(fixation_array, midlines, **algo_params)
+        elif algo == "segment":
+            corrected_fix_y_vals = segment(fixation_array, midlines, **algo_params)
+        elif algo == "split":
+            corrected_fix_y_vals = split(fixation_array, midlines, **algo_params)
+        elif algo == "stretch":
+            corrected_fix_y_vals = stretch(fixation_array, midlines, **algo_params)
+        elif algo == "attach":
+            corrected_fix_y_vals = attach(fixation_array, midlines)
+        elif algo == "compare":
+            word_center_list = [(word["word_x_center"], word["word_y_center"]) for word in trial["words_list"]]
+            n_nearest_lines = min(algo_params["n_nearest_lines"], len(midlines) - 1)
+            algo_params["n_nearest_lines"] = n_nearest_lines
+            corrected_fix_y_vals = compare(fixation_array, np.array(word_center_list), **algo_params)
+        else:
+            raise NotImplementedError(f"{algo} not implemented")
+    corrected_fix_y_vals = np.round(corrected_fix_y_vals, decimals=2)
+    corrected_line_nums = [trial["y_char_unique"].index(y) for y in corrected_fix_y_vals]
+    dffix[f"y_{algo}"] = corrected_fix_y_vals
+    dffix[f"line_num_{algo}"] = corrected_line_nums
+    dffix = dffix.copy()
+    return dffix
+def slice(fixation_XY, midlines, line_height: float, x_thresh=192, y_thresh=32, w_thresh=32, n_thresh=90):
+    """
+    Form a set of runs and then reduce the set to *m* by repeatedly merging
+    those that appear to be on the same line. Merged sequences are then
+    assigned to text lines in positional order. Default params:
+    `x_thresh=192`, `y_thresh=32`, `w_thresh=32`, `n_thresh=90`. Requires
+    NumPy. Original method by [Glandorf & Schroeder (2021)](https://doi.org/10.1016/j.procs.2021.09.069).
+    """
+    fixation_XY = np.array(fixation_XY, dtype=float)
+    line_Y = np.array(midlines, dtype=float)
+    proto_lines, phantom_proto_lines = {}, {}
+    # 1. Segment runs
+    dist_X = abs(np.diff(fixation_XY[:, 0]))
+    dist_Y = abs(np.diff(fixation_XY[:, 1]))
+    end_run_indices = list(np.where(np.logical_or(dist_X > x_thresh, dist_Y > y_thresh))[0] + 1)
+    run_starts = [0] + end_run_indices
+    run_ends = end_run_indices + [len(fixation_XY)]
+    runs = [list(range(start, end)) for start, end in zip(run_starts, run_ends)]
+    # 2. Determine starting run
+    longest_run_i = np.argmax([fixation_XY[run[-1], 0] - fixation_XY[run[0], 0] for run in runs])
+    proto_lines[0] = runs.pop(longest_run_i)
+    # 3. Group runs into proto lines
+    while runs:
+        merger_on_this_iteration = False
+        for proto_line_i, direction in [(min(proto_lines), -1), (max(proto_lines), 1)]:
+            # Create new proto line above or below (depending on direction)
+            proto_lines[proto_line_i + direction] = []
+            # Get current proto line XY coordinates (if proto line is empty, get phanton coordinates)
+            if proto_lines[proto_line_i]:
+                proto_line_XY = fixation_XY[proto_lines[proto_line_i]]
+            else:
+                proto_line_XY = phantom_proto_lines[proto_line_i]
+            # Compute differences between current proto line and all runs
+            run_differences = np.zeros(len(runs))
+            for run_i, run in enumerate(runs):
+                y_diffs = [y - proto_line_XY[np.argmin(abs(proto_line_XY[:, 0] - x)), 1] for x, y in fixation_XY[run]]
+                run_differences[run_i] = np.mean(y_diffs)
+            # Find runs that can be merged into this proto line
+            merge_into_current = list(np.where(abs(run_differences) < w_thresh)[0])
+            # Find runs that can be merged into the adjacent proto line
+            merge_into_adjacent = list(
+                np.where(
+                    np.logical_and(
+                        run_differences * direction >= w_thresh,
+                        run_differences * direction < n_thresh,
+                    )
+                )[0]
+            )
+            # Perform mergers
+            for index in merge_into_current:
+                proto_lines[proto_line_i].extend(runs[index])
+            for index in merge_into_adjacent:
+                proto_lines[proto_line_i + direction].extend(runs[index])
+            # If no, mergers to the adjacent, create phantom line for the adjacent
+            if not merge_into_adjacent:
+                average_x, average_y = np.mean(proto_line_XY, axis=0)
+                adjacent_y = average_y + line_height * direction
+                phantom_proto_lines[proto_line_i + direction] = np.array([[average_x, adjacent_y]])
+            # Remove all runs that were merged on this iteration
+            for index in sorted(merge_into_current + merge_into_adjacent, reverse=True):
+                del runs[index]
+                merger_on_this_iteration = True
+        # If no mergers were made, break the while loop
+        if not merger_on_this_iteration:
+            break
+    # 4. Assign any leftover runs to the closest proto lines
+    for run in runs:
+        best_pl_distance = np.inf
+        best_pl_assignemnt = None
+        for proto_line_i in proto_lines:
+            if proto_lines[proto_line_i]:
+                proto_line_XY = fixation_XY[proto_lines[proto_line_i]]
+            else:
+                proto_line_XY = phantom_proto_lines[proto_line_i]
+            y_diffs = [y - proto_line_XY[np.argmin(abs(proto_line_XY[:, 0] - x)), 1] for x, y in fixation_XY[run]]
+            pl_distance = abs(np.mean(y_diffs))
+            if pl_distance < best_pl_distance:
+                best_pl_distance = pl_distance
+                best_pl_assignemnt = proto_line_i
+        proto_lines[best_pl_assignemnt].extend(run)
+    # 5. Prune proto lines
+    while len(proto_lines) > len(line_Y):
+        top, bot = min(proto_lines), max(proto_lines)
+        if len(proto_lines[top]) < len(proto_lines[bot]):
+            proto_lines[top + 1].extend(proto_lines[top])
+            del proto_lines[top]
+        else:
+            proto_lines[bot - 1].extend(proto_lines[bot])
+            del proto_lines[bot]
+    # 6. Map proto lines to text lines
+    for line_i, proto_line_i in enumerate(sorted(proto_lines)):
+        fixation_XY[proto_lines[proto_line_i], 1] = line_Y[line_i]
+    return fixation_XY[:, 1]
+def attach(fixation_XY, line_Y):
+    n = len(fixation_XY)
+    for fixation_i in range(n):
+        line_i = np.argmin(abs(line_Y - fixation_XY[fixation_i, 1]))
+        fixation_XY[fixation_i, 1] = line_Y[line_i]
+    return fixation_XY[:, 1]
+def chain(fixation_XY, midlines, x_thresh=192, y_thresh=32):
+    """
+    Chain consecutive fixations that are sufficiently close to each other, and
+    then assign chains to their closest text lines. Default params:
+    `x_thresh=192`, `y_thresh=32`. Requires NumPy. Original method
+    implemented in [popEye](https://github.com/sascha2schroeder/popEye/).
+    """
+    try:
+        import numpy as np
+    except ModuleNotFoundError as e:
+        e.msg = "The chain method requires NumPy."
+        raise
+    fixation_XY = np.array(fixation_XY)
+    line_Y = np.array(midlines)
+    dist_X = abs(np.diff(fixation_XY[:, 0]))
+    dist_Y = abs(np.diff(fixation_XY[:, 1]))
+    end_chain_indices = list(np.where(np.logical_or(dist_X > x_thresh, dist_Y > y_thresh))[0] + 1)
+    end_chain_indices.append(len(fixation_XY))
+    start_of_chain = 0
+    for end_of_chain in end_chain_indices:
+        mean_y = np.mean(fixation_XY[start_of_chain:end_of_chain, 1])
+        line_i = np.argmin(abs(line_Y - mean_y))
+        fixation_XY[start_of_chain:end_of_chain, 1] = line_Y[line_i]
+        start_of_chain = end_of_chain
+    return fixation_XY[:, 1]
+def cluster(fixation_XY, line_Y):
+    m = len(line_Y)
+    fixation_Y = fixation_XY[:, 1].reshape(-1, 1)
+    if fixation_Y.shape[0] < m:
+        ic(f"CLUSTER failed because of low number of fixations: {fixation_XY.shape}")
+        ic("Assigned all fixation to first line")
+        return np.ones_like(fixation_XY[:, 1]) * line_Y[0]
+    clusters = KMeans(m, n_init=100, max_iter=300).fit_predict(fixation_Y)
+    centers = [fixation_Y[clusters == i].mean() for i in range(m)]
+    ordered_cluster_indices = np.argsort(centers)
+    for fixation_i, cluster_i in enumerate(clusters):
+        line_i = np.where(ordered_cluster_indices == cluster_i)[0][0]
+        fixation_XY[fixation_i, 1] = line_Y[line_i]
+    return fixation_XY[:, 1]
+def compare(fixation_XY, word_XY, x_thresh=512, n_nearest_lines=3):
+    # COMPARE
+    #
+    # Lima Sanches, C., Kise, K., & Augereau, O. (2015). Eye gaze and text
+    #   line matching for reading analysis. In Adjunct proceedings of the
+    #   2015 ACM International Joint Conference on Pervasive and
+    #   Ubiquitous Computing and proceedings of the 2015 ACM International
+    #   Symposium on Wearable Computers (pp. 1227–1233). Association for
+    #   Computing Machinery.
+    #
+    # https://doi.org/10.1145/2800835.2807936
+    line_Y = np.unique(word_XY[:, 1])
+    n = len(fixation_XY)
+    diff_X = np.diff(fixation_XY[:, 0])
+    end_line_indices = list(np.where(diff_X < -x_thresh)[0] + 1)
+    end_line_indices.append(n)
+    start_of_line = 0
+    for end_of_line in end_line_indices:
+        gaze_line = fixation_XY[start_of_line:end_of_line]
+        mean_y = np.mean(gaze_line[:, 1])
+        lines_ordered_by_proximity = np.argsort(abs(line_Y - mean_y))
+        nearest_line_I = lines_ordered_by_proximity[:n_nearest_lines]
+        line_costs = np.zeros(n_nearest_lines)
+        for candidate_i in range(n_nearest_lines):
+            candidate_line_i = nearest_line_I[candidate_i]
+            text_line = word_XY[word_XY[:, 1] == line_Y[candidate_line_i]]
+            dtw_cost, dtw_path = dynamic_time_warping(gaze_line[:, 0:1], text_line[:, 0:1])
+            line_costs[candidate_i] = dtw_cost
+        line_i = nearest_line_I[np.argmin(line_costs)]
+        fixation_XY[start_of_line:end_of_line, 1] = line_Y[line_i]
+        start_of_line = end_of_line
+    return fixation_XY[:, 1]
+def merge(fixation_XY, midlines, text_right_to_left=False, y_thresh=32, gradient_thresh=0.1, error_thresh=20):
+    """
+    Form a set of progressive sequences and then reduce the set to *m* by
+    repeatedly merging those that appear to be on the same line. Merged
+    sequences are then assigned to text lines in positional order. Default
+    params: `y_thresh=32`, `gradient_thresh=0.1`, `error_thresh=20`. Requires
+    NumPy. Original method by [Špakov et al. (2019)](https://doi.org/10.3758/s13428-018-1120-x).
+    """
+    try:
+        import numpy as np
+    except ModuleNotFoundError as e:
+        e.msg = "The merge method requires NumPy."
+        raise
+    fixation_XY = np.array(fixation_XY)
+    line_Y = np.array(midlines)
+    diff_X = np.diff(fixation_XY[:, 0])
+    dist_Y = abs(np.diff(fixation_XY[:, 1]))
+    if text_right_to_left:
+        sequence_boundaries = list(np.where(np.logical_or(diff_X > 0, dist_Y > y_thresh))[0] + 1)
+    else:
+        sequence_boundaries = list(np.where(np.logical_or(diff_X < 0, dist_Y > y_thresh))[0] + 1)
+    sequence_starts = [0] + sequence_boundaries
+    sequence_ends = sequence_boundaries + [len(fixation_XY)]
+    sequences = [list(range(start, end)) for start, end in zip(sequence_starts, sequence_ends)]
+    for min_i, min_j, remove_constraints in [
+        (3, 3, False),  # Phase 1
+        (1, 3, False),  # Phase 2
+        (1, 1, False),  # Phase 3
+        (1, 1, True),  # Phase 4
+    ]:
+        while len(sequences) > len(line_Y):
+            best_merger = None
+            best_error = np.inf
+            for i in range(len(sequences) - 1):
+                if len(sequences[i]) < min_i:
+                    continue  # first sequence too short, skip to next i
+                for j in range(i + 1, len(sequences)):
+                    if len(sequences[j]) < min_j:
+                        continue  # second sequence too short, skip to next j
+                    candidate_XY = fixation_XY[sequences[i] + sequences[j]]
+                    gradient, intercept = np.polyfit(candidate_XY[:, 0], candidate_XY[:, 1], 1)
+                    residuals = candidate_XY[:, 1] - (gradient * candidate_XY[:, 0] + intercept)
+                    error = np.sqrt(sum(residuals**2) / len(candidate_XY))
+                    if remove_constraints or (abs(gradient) < gradient_thresh and error < error_thresh):
+                        if error < best_error:
+                            best_merger = (i, j)
+                            best_error = error
+            if best_merger is None:
+                break  # no possible mergers, break while and move to next phase
+            merge_i, merge_j = best_merger
+            merged_sequence = sequences[merge_i] + sequences[merge_j]
+            sequences.append(merged_sequence)
+            del sequences[merge_j], sequences[merge_i]
+    mean_Y = [fixation_XY[sequence, 1].mean() for sequence in sequences]
+    ordered_sequence_indices = np.argsort(mean_Y)
+    for line_i, sequence_i in enumerate(ordered_sequence_indices):
+        fixation_XY[sequences[sequence_i], 1] = line_Y[line_i]
+    return fixation_XY[:, 1]
+def regress(
+    fixation_XY,
+    midlines,
+    slope_bounds=(-0.1, 0.1),
+    offset_bounds=(-50, 50),
+    std_bounds=(1, 20),
+):
+    """
+    Find *m* regression lines that best fit the fixations and group fixations
+    according to best fit regression lines, and then assign groups to text
+    lines in positional order. Default params: `slope_bounds=(-0.1, 0.1)`,
+    `offset_bounds=(-50, 50)`, `std_bounds=(1, 20)`. Requires SciPy.
+    Original method by [Cohen (2013)](https://doi.org/10.3758/s13428-012-0280-3).
+    """
+    try:
+        import numpy as np
+        from scipy.optimize import minimize
+        from scipy.stats import norm
+    except ModuleNotFoundError as e:
+        e.msg = "The regress method requires SciPy."
+        raise
+    fixation_XY = np.array(fixation_XY)
+    line_Y = np.array(midlines)
+    density = np.zeros((len(fixation_XY), len(line_Y)))
+    def fit_lines(params):
+        k = slope_bounds[0] + (slope_bounds[1] - slope_bounds[0]) * norm.cdf(params[0])
+        o = offset_bounds[0] + (offset_bounds[1] - offset_bounds[0]) * norm.cdf(params[1])
+        s = std_bounds[0] + (std_bounds[1] - std_bounds[0]) * norm.cdf(params[2])
+        predicted_Y_from_slope = fixation_XY[:, 0] * k
+        line_Y_plus_offset = line_Y + o
+        for line_i in range(len(line_Y)):
+            fit_Y = predicted_Y_from_slope + line_Y_plus_offset[line_i]
+            density[:, line_i] = norm.logpdf(fixation_XY[:, 1], fit_Y, s)
+        return -sum(density.max(axis=1))
+    best_fit = minimize(fit_lines, [0, 0, 0], method="powell")
+    fit_lines(best_fit.x)
+    return line_Y[density.argmax(axis=1)]
+def segment(fixation_XY, midlines, text_right_to_left=False):
+    """
+    Segment fixation sequence into *m* subsequences based on *m*–1 most-likely
+    return sweeps, and then assign subsequences to text lines in chronological
+    order. Requires NumPy. Original method by
+    [Abdulin & Komogortsev (2015)](https://doi.org/10.1109/BTAS.2015.7358786).
+    """
+    try:
+        import numpy as np
+    except ModuleNotFoundError as e:
+        e.msg = "The segment method requires NumPy."
+        raise
+    fixation_XY = np.array(fixation_XY)
+    line_Y = np.array(midlines)
+    diff_X = np.diff(fixation_XY[:, 0])
+    saccades_ordered_by_length = np.argsort(diff_X)
+    if text_right_to_left:
+        line_change_indices = saccades_ordered_by_length[-(len(line_Y) - 1) :]
+    else:
+        line_change_indices = saccades_ordered_by_length[: len(line_Y) - 1]
+    current_line_i = 0
+    for fixation_i in range(len(fixation_XY)):
+        fixation_XY[fixation_i, 1] = line_Y[current_line_i]
+        if fixation_i in line_change_indices:
+            current_line_i += 1
+    return fixation_XY[:, 1]
+def split(fixation_XY, midlines, text_right_to_left=False):
+    """
+    Split fixation sequence into subsequences based on best candidate return
+    sweeps, and then assign subsequences to closest text lines. Requires
+    SciPy. Original method by [Carr et al. (2022)](https://doi.org/10.3758/s13428-021-01554-0).
+    """
+    try:
+        import numpy as np
+        from scipy.cluster.vq import kmeans2
+    except ModuleNotFoundError as e:
+        e.msg = "The split method requires SciPy."
+        raise
+    fixation_XY = np.array(fixation_XY)
+    line_Y = np.array(midlines)
+    diff_X = np.array(np.diff(fixation_XY[:, 0]), dtype=float).reshape(-1, 1)
+    centers, clusters = kmeans2(diff_X, 2, iter=100, minit="++", missing="raise")
+    if text_right_to_left:
+        sweep_marker = np.argmax(centers)
+    else:
+        sweep_marker = np.argmin(centers)
+    end_line_indices = list(np.where(clusters == sweep_marker)[0] + 1)
+    end_line_indices.append(len(fixation_XY))
+    start_of_line = 0
+    for end_of_line in end_line_indices:
+        mean_y = np.mean(fixation_XY[start_of_line:end_of_line, 1])
+        line_i = np.argmin(abs(line_Y - mean_y))
+        fixation_XY[start_of_line:end_of_line] = line_Y[line_i]
+        start_of_line = end_of_line
+    return fixation_XY[:, 1]
+def stretch(fixation_XY, midlines, stretch_bounds=(0.9, 1.1), offset_bounds=(-50, 50)):
+    """
+    Find a stretch factor and offset that results in a good alignment between
+    the fixations and lines of text, and then assign the transformed fixations
+    to the closest text lines. Default params: `stretch_bounds=(0.9, 1.1)`,
+    `offset_bounds=(-50, 50)`. Requires SciPy.
+    Original method by [Lohmeier (2015)](http://www.monochromata.de/master_thesis/ma1.3.pdf).
+    """
+    try:
+        import numpy as np
+        from scipy.optimize import minimize
+    except ModuleNotFoundError as e:
+        e.msg = "The stretch method requires SciPy."
+        raise
+    fixation_Y = np.array(fixation_XY)[:, 1]
+    line_Y = np.array(midlines)
+    n = len(fixation_Y)
+    corrected_Y = np.zeros(n)
+    def fit_lines(params):
+        candidate_Y = fixation_Y * params[0] + params[1]
+        for fixation_i in range(n):
+            line_i = np.argmin(abs(line_Y - candidate_Y[fixation_i]))
+            corrected_Y[fixation_i] = line_Y[line_i]
+        return sum(abs(candidate_Y - corrected_Y))
+    best_fit = minimize(fit_lines, [1, 0], method="powell", bounds=[stretch_bounds, offset_bounds])
+    fit_lines(best_fit.x)
+    return corrected_Y
+def warp(fixation_XY, word_center_list):
+    """
+    Map fixations to word centers using [Dynamic Time
+    Warping](https://en.wikipedia.org/wiki/Dynamic_time_warping). This finds a
+    monotonically increasing mapping between fixations and words with the
+    shortest overall distance, effectively resulting in *m* subsequences.
+    Fixations are then assigned to the lines that their mapped words belong
+    to, effectively assigning subsequences to text lines in chronological
+    order. Requires NumPy.
+    Original method by [Carr et al. (2022)](https://doi.org/10.3758/s13428-021-01554-0).
+    """
+    try:
+        import numpy as np
+    except ModuleNotFoundError as e:
+        e.msg = "The warp method requires NumPy."
+        raise
+    fixation_XY = np.array(fixation_XY)
+    word_XY = np.array([word_center for word_center in word_center_list])
+    n1 = len(fixation_XY)
+    n2 = len(word_XY)
+    cost = np.zeros((n1 + 1, n2 + 1))
+    cost[0, :] = np.inf
+    cost[:, 0] = np.inf
+    cost[0, 0] = 0
+    for fixation_i in range(n1):
+        for word_i in range(n2):
+            distance = np.sqrt(sum((fixation_XY[fixation_i] - word_XY[word_i]) ** 2))
+            cost[fixation_i + 1, word_i + 1] = distance + min(
+                cost[fixation_i, word_i + 1],
+                cost[fixation_i + 1, word_i],
+                cost[fixation_i, word_i],
+            )
+    cost = cost[1:, 1:]
+    warping_path = [[] for _ in range(n1)]
+    while fixation_i > 0 or word_i > 0:
+        warping_path[fixation_i].append(word_i)
+        possible_moves = [np.inf, np.inf, np.inf]
+        if fixation_i > 0 and word_i > 0:
+            possible_moves[0] = cost[fixation_i - 1, word_i - 1]
+        if fixation_i > 0:
+            possible_moves[1] = cost[fixation_i - 1, word_i]
+        if word_i > 0:
+            possible_moves[2] = cost[fixation_i, word_i - 1]
+        best_move = np.argmin(possible_moves)
+        if best_move == 0:
+            fixation_i -= 1
+            word_i -= 1
+        elif best_move == 1:
+            fixation_i -= 1
+        else:
+            word_i -= 1
+    warping_path[0].append(0)
+    for fixation_i, words_mapped_to_fixation_i in enumerate(warping_path):
+        candidate_Y = list(word_XY[words_mapped_to_fixation_i, 1])
+        fixation_XY[fixation_i, 1] = max(set(candidate_Y), key=candidate_Y.count)
+    return fixation_XY[:, 1]
+def dynamic_time_warping(sequence1, sequence2):
+    n1 = len(sequence1)
+    n2 = len(sequence2)
+    dtw_cost = np.zeros((n1 + 1, n2 + 1))
+    dtw_cost[0, :] = np.inf
+    dtw_cost[:, 0] = np.inf
+    dtw_cost[0, 0] = 0
+    for i in range(n1):
+        for j in range(n2):
+            this_cost = np.sqrt(sum((sequence1[i] - sequence2[j]) ** 2))
+            dtw_cost[i + 1, j + 1] = this_cost + min(dtw_cost[i, j + 1], dtw_cost[i + 1, j], dtw_cost[i, j])
+    dtw_cost = dtw_cost[1:, 1:]
+    dtw_path = [[] for _ in range(n1)]
+    while i > 0 or j > 0:
+        dtw_path[i].append(j)
+        possible_moves = [np.inf, np.inf, np.inf]
+        if i > 0 and j > 0:
+            possible_moves[0] = dtw_cost[i - 1, j - 1]
+        if i > 0:
+            possible_moves[1] = dtw_cost[i - 1, j]
+        if j > 0:
+            possible_moves[2] = dtw_cost[i, j - 1]
+        best_move = np.argmin(possible_moves)
+        if best_move == 0:
+            i -= 1
+            j -= 1
+        elif best_move == 1:
+            i -= 1
+        else:
+            j -= 1
+    dtw_path[0].append(0)
+    return dtw_cost[-1, -1], dtw_path
+def wisdom_of_the_crowd(assignments):
+    """
+    For each fixation, choose the y-value with the most votes across multiple
+    algorithms. In the event of a tie, the left-most algorithm is given
+    priority.
+    """
+    assignments = np.column_stack(assignments)
+    correction = []
+    for row in assignments:
+        candidates = list(row)
+        candidate_counts = {y: candidates.count(y) for y in set(candidates)}
+        best_count = max(candidate_counts.values())
+        best_candidates = [y for y, c in candidate_counts.items() if c == best_count]
+        if len(best_candidates) == 1:
+            correction.append(best_candidates[0])
+        else:
+            for y in row:
+                if y in best_candidates:
+                    correction.append(y)
+                    break
+    return correction

emreading_funcs.py ADDED Viewed

	@@ -0,0 +1,994 @@

+"""Mostly adapted from https://github.com/martin-vasilev/EMreading
+Moslty deprecated in favour of alternative methods."""
+from icecream import ic
+from io import StringIO
+import re
+import numpy as np
+import pandas as pd
+def assign_chars_to_words(df):
+    df.reset_index(inplace=True, names="index_temp")
+    df["wordID"] = ""
+    df["char_word"] = -1
+    word_list = []
+    cols = []
+    sent_list = df["sent"].unique()
+    for i in range(len(sent_list)):  # for each sentence
+        word_list = df[df["sent"] == i]["word"].unique()
+        for j in range(len(word_list)):
+            cols = df[(df["sent"] == i) & (df["word"] == word_list[j])].index
+            df.loc[cols, "wordID"] = "".join(df["char"].loc[cols])
+            df.loc[(df["sent"] == i) & (df["word"] == word_list[j]), "char_word"] = [k for k in range(len(cols))]
+    df.set_index("index_temp", inplace=True)
+    return df
+def round_and_int(value):
+    if not pd.isna(value):
+        return int(round(value))
+    else:
+        return None
+def get_coord_map(coords, x=1920, y=1080):
+    """
+    Original R version:
+    ```R
+    # Use stimuli information to create a coordinate map_arr for each pixel on the screen
+    # This makes it possible to find exactly what participants were fixating
+    coord_map_arr<- function(coords, x=resolution_x, y= resolution_y){
+    coords$id<- 1:nrow(coords)
+    map_arr<- data.frame(matrix(NA, nrow = y, ncol = x))
+    for(i in 1:nrow(coords)){
+        map_arr[coords$y1[i]:coords$y2[i],coords$x1[i]:coords$x2[i]]<- coords$id[i]
+    }
+    return(map_arr)
+    }```
+    """
+    coords.reset_index(drop=True, inplace=True)
+    y1 = coords["char_ymin"].map(round_and_int)
+    y2 = coords["char_ymax"].map(round_and_int)
+    x1 = coords["char_xmin"].map(round_and_int)
+    x2 = coords["char_xmax"].map(round_and_int)
+    coords["id"] = np.arange(len(coords))
+    map_arr = np.full((y, x), np.nan)
+    for i in range(len(coords)):
+        map_arr[y1[i] : y2[i] + 1, x1[i] : x2[i] + 1] = coords["id"].iloc[i]
+    np.sum(pd.isna(map_arr), axis=None)
+    return map_arr
+def get_char_num_for_each_line(df):
+    df.reset_index(inplace=True, names="index_temp")
+    df["line_char"] = np.nan
+    unq_line = df["assigned_line"].unique()
+    for i in unq_line:
+        assigned_line = df[df["assigned_line"] == i].index
+        df.loc[assigned_line, "line_char"] = range(len(assigned_line))
+    df.set_index("index_temp", inplace=True)
+    return df
+def parse_fix(
+    file,
+    trial_db,
+):
+    indexrange = list(range(trial_db["trial_start_idx"], trial_db["trial_end_idx"] + 1))
+    sfix_stamps = [i for i in indexrange if re.search(r"(?i)(SFIX)", file[i])]
+    efix_stamps = [i for i in indexrange if re.search(r"(?i)EFIX", file[i])]
+    if len(sfix_stamps) > (len(efix_stamps) + 1):
+        ic(f"length mismatch parse_fix of {len(sfix_stamps) - (len(efix_stamps))}")
+    if not sfix_stamps or not efix_stamps:
+        raw_fix = None
+        return raw_fix
+    for safe_num in range(25):
+        if efix_stamps[0] < sfix_stamps[0]:
+            efix_stamps = efix_stamps[1:]
+        elif efix_stamps[-1] <= sfix_stamps[-1]:
+            sfix_stamps = sfix_stamps[:-1]
+        elif efix_stamps[0] >= sfix_stamps[0]:
+            sfix_stamps = sfix_stamps[1:]
+        if not (len(efix_stamps) != len(sfix_stamps) and len(efix_stamps) > 1 and len(sfix_stamps) > 1):
+            break
+    def parse_sacc(string):
+        a = string.split("	")
+        return float(a[2])
+    esacc_flag = [file[f - 1] if "ESACC" in file[f - 1] else None for f in sfix_stamps]
+    saccDur = []
+    for k in esacc_flag:
+        if k is None:
+            saccDur.append(None)
+        else:
+            saccDur.append(parse_sacc(k))
+    s_time = [int(file[s].strip().split(" ")[-1]) for s in sfix_stamps]
+    e_time = [int(file[s - 1].strip().split(" ")[0]) for s in efix_stamps]
+    if len(s_time) != len(e_time):
+        if s_time[-1] > e_time[-1]:
+            s_time = s_time[:-1]
+    fixDur = [e_time[index] - s_time[index] for index in range(len(s_time))]
+    fixDur = [e - s for e, s in zip(e_time, s_time)]
+    assert ~(np.asarray(fixDur) < 0).any()
+    x = [float(file[fidx].split("\t")[3]) for fidx in efix_stamps]
+    y = [float(file[fidx].split("\t")[4]) for fidx in efix_stamps]
+    blink_stamp = [index for index in indexrange if "EBLINK" in file[index]]
+    blink_time = [float(file[index].strip().replace("\t", " ").split(" ")[2]) - 1 for index in blink_stamp]
+    index = np.searchsorted(s_time, blink_time, side="right") - 1
+    blink = np.zeros((len(s_time)))
+    blink[index] = -1
+    raw_fix = pd.DataFrame(
+        {"s_time": s_time, "e_time": e_time, "fixDur": fixDur, "saccDur": saccDur, "x": x, "y": y, "blink": blink}
+    )
+    return raw_fix
+def process_fix_EM(fix, coords_map, coords, SL):
+    resolution_y, resolution_x = coords_map.shape
+    loc = None
+    raw_fix = pd.DataFrame()
+    num_fixations = len(fix)
+    SFIX = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    EFIX = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    x = np.full(num_fixations, np.nan)
+    y = np.full(num_fixations, np.nan)
+    fix_num = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    fix_dur = np.full(num_fixations, None)
+    sent = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    line = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    word = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    char_trial = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    char_line = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    word_line = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    max_sent = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    max_word = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    regress = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    blink = pd.array([None] * num_fixations, dtype=pd.BooleanDtype())
+    outOfBnds = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    outsideText = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    wordID = np.full(num_fixations, None)
+    land_pos = pd.array([None] * num_fixations, dtype=pd.Int64Dtype())
+    sacc_len = np.full(num_fixations, np.nan)
+    max_sentence = coords["in_sentence_number"].max()
+    curr_sent = np.zeros((max_sentence + 1, 2))
+    curr_sent[: max_sentence + 1, 0] = np.arange(0, max_sentence + 1)
+    if isinstance(coords["index"], str):
+        coords["index"] = pd.to_numeric(coords["index"], errors="coerce")
+    for j in range(len(fix)):
+        if (fix["y"][j] > 0) and (fix["x"][j] > 0) and (fix["y"][j] <= resolution_y) and (fix["x"][j] <= resolution_x):
+            loc = coords_map[round(fix["y"][j]), round(fix["x"][j])]
+            if pd.isnull(loc):
+                loc = None
+        else:
+            loc = None
+        fix_num[j] = j
+        fix_dur[j] = fix["duration"][j]
+        SFIX[j] = fix["start_uncorrected"][j]
+        EFIX[j] = fix["stop_uncorrected"][j]
+        x[j] = fix["x"][j]
+        y[j] = fix["y"][j]
+        blink[j] = fix["blink"][j]
+        if x[j] < 1 or x[j] > resolution_x or y[j] < 1 or y[j] > resolution_y:
+            outOfBnds[j] = 1
+        else:
+            outOfBnds[j] = 0
+            outsideText[j] = 1 if loc is None else 0
+        if fix["x"][j] < 0:
+            loc = None
+            outOfBnds[j] = 1
+            outsideText[j] = 1
+        if loc is not None:
+            sent[j] = coords["in_sentence_number"][loc]
+            line[j] = coords["assigned_line"][loc]
+            word[j] = coords["in_word_number"][loc]
+            word_line[j] = coords["wordline"][loc]
+            char_trial[j] = coords["index"][loc] + 1
+            char_line[j] = coords["letline"][loc]
+            wordID[j] = coords["in_word"][loc]
+            land_pos[j] = coords["letword"][loc]
+            if j > 0 and not pd.isna(char_trial[j]) and not pd.isna(char_trial[j - 1]):
+                sacc_len[j] = abs(char_trial[j] - char_trial[j - 1])
+            else:
+                sacc_len[j] = np.nan
+        else:
+            sent[j] = np.nan
+            line[j] = np.nan
+            word[j] = np.nan
+            word_line[j] = np.nan
+            char_trial[j] = np.nan
+            char_line[j] = np.nan
+            wordID[j] = np.nan
+            land_pos[j] = np.nan
+            sacc_len[j] = np.nan
+        if SL:
+            if loc is not None:
+                if j == 0:
+                    max_sent[j] = sent[j]
+                else:
+                    max_sent[j] = max_sent[j - 1] if pd.isna(sent[j]) or pd.isna(max_sent[j - 1]) else max_sent[j - 1]
+                    if not (pd.isna(max_sent[j]) or pd.isna(sent[j])) and sent[j] > max_sent[j]:
+                        max_sent[j] = sent[j]
+                if j == 0:
+                    max_word[j] = abs(word[j])
+                    curr_sent[sent[j] - 1, 1] = abs(word[j])
+                else:
+                    max_word[j] = (
+                        curr_sent[sent[j] - 1, 1]
+                        if pd.isna(word[j]) or pd.isna(curr_sent[sent[j] - 1, 1])
+                        else curr_sent[sent[j] - 1, 1]
+                    )
+                    if not (pd.isna(word[j]) or pd.isna(max_word[j])) and abs(word[j]) > curr_sent[sent[j] - 1, 1]:
+                        max_word[j] = abs(word[j])
+                        curr_sent[sent[j] - 1, 1] = abs(word[j])
+                if not (pd.isna(word[j]) or pd.isna(max_word[j])) and abs(word[j]) < max_word[j]:
+                    regress[j] = 1
+                else:
+                    regress[j] = 0
+                if j > 0 and not pd.isna(word[j]):
+                    if pd.isna(regress[j - 1]):
+                        regress[j] = np.nan
+                    else:
+                        if abs(word[j]) == max_word[j] and regress[j - 1] == 1 and word[j] in np.unique(word[:j]):
+                            regress[j] = 1
+    raw_fix = pd.DataFrame(
+        {
+            "start_uncorrected": SFIX,
+            "stop_uncorrected": EFIX,
+            "x": x,
+            "y": y,
+            "fixation_number": fix_num,
+            "on_sentence_number_EM": sent,
+            "line_EM": line,
+            "word_EM": word,
+            "word_line_EM": word_line,
+            "char_trial_EM": char_trial,
+            "char_line_EM": char_line,
+            "regress_EM": regress,
+            "wordID_EM": wordID,
+            "land_pos_EM": land_pos,
+            "sacc_len_EM": sacc_len,
+            "blink_EM": blink,
+            "outOfBnds_EM": outOfBnds,
+            "outsideText_EM": outsideText,
+        }
+    )
+    fix2 = fix.merge(
+        raw_fix,
+        on=[
+            "start_uncorrected",
+            "stop_uncorrected",
+            "x",
+            "y",
+            "fixation_number",
+        ],
+        how="left",
+    )
+    return fix2
+def RS(i, rawfix, coords, reqYthresh, reqXthresh, Ythresh, Xthresh, threshSimilar):
+    if i == 0:
+        return 0
+    lw = coords["char_xmax"][0] - coords["char_xmin"][0]
+    lh = coords["char_ymax"][0] - coords["char_ymin"][0]
+    meetXthresh = False
+    meetYthresh = False
+    leftSacc = rawfix["x"][i] < rawfix["x"][i - 1]
+    downSacc = rawfix["y"][i] > rawfix["y"][i - 1]
+    if downSacc & reqYthresh:
+        Ydiff = lh * Ythresh
+        trueYdiff = rawfix["y"][i] - rawfix["y"][i - 1]
+        meetYthresh = trueYdiff >= Ydiff
+    if leftSacc & reqXthresh:
+        Xdiff = lw * Xthresh
+        trueXdiff = rawfix["x"][i - 1] - rawfix["x"][i]
+        meetXthresh = trueXdiff >= Xdiff
+    maxPoints = 1 + 2
+    if reqYthresh:
+        maxPoints += 1
+    if reqXthresh:
+        maxPoints += 1
+    currPoints = 0
+    if leftSacc:
+        currPoints = currPoints + (1 / maxPoints)
+        if meetXthresh:
+            currPoints = currPoints + (1 / maxPoints)
+    if downSacc:
+        currPoints = currPoints + 2 * (1 / maxPoints)
+        if meetYthresh:
+            currPoints = currPoints + (1 / maxPoints)
+    return round(currPoints, 2)
+def reMap(rawfix, i, coords_map, coords, newY=None):
+    rawfix.set_index("fixation_number", inplace=True)
+    assert i in rawfix.index, "Not in index"
+    rawfix.loc[i, "reAligned"] = True
+    rawfix.loc[i, "previous_line"] = rawfix.loc[i, "line_EM"]
+    rawfix.loc[i, "previous_y"] = rawfix.loc[i, "y"]
+    if newY != None:
+        rawfix.loc[i, "y"] = newY
+    loc = coords_map[round(rawfix.loc[i, "y"]), round(rawfix.loc[i, "x"])]
+    if pd.isnull(loc):
+        return rawfix
+    rawfix.loc[i, "on_sentence_number_EM"] = coords["in_sentence_number"][loc]
+    rawfix.loc[i, "word_EM"] = coords["in_word_number"][loc]
+    rawfix.loc[i, "line_EM"] = coords["assigned_line"][loc]
+    return rawfix.reset_index(drop=False, names=["fixation_number"])
+def reAlign(rawfix, coords, coords_map, RSpar):
+    ystart = coords["char_ymin"].min()
+    yend = coords["char_ymax"].max()
+    nlines = coords["assigned_line"].max()
+    letterHeight = coords["char_ymax"][0] - coords["char_ymin"][0]
+    xstart = pd.DataFrame(columns=["1", "2"])
+    xstart["1"] = np.arange(nlines + 1)
+    ystart = pd.DataFrame(columns=["1", "2"])
+    ystart["1"] = np.arange(nlines + 1)
+    xend = pd.DataFrame(columns=["1", "2"])
+    xend["1"] = np.arange(nlines + 1)
+    yend = pd.DataFrame(columns=["1", "2"])
+    yend["1"] = np.arange(nlines + 1)
+    rawfix["previous_x"] = np.nan
+    for i in coords["assigned_line"].unique():
+        a = coords[coords["assigned_line"] == i]
+        xstart.loc[i, "2"] = a["char_xmin"].min()
+        xend.loc[i, "2"] = a["char_xmax"].max()
+        ystart.loc[i, "2"] = a["char_ymin"].min()
+        yend.loc[i, "2"] = a["char_ymax"].min()
+    lineCenter = ystart["2"] + letterHeight / 2
+    rawfix["prob_return_sweep"] = np.nan
+    rawfix["prob_interline_saccade"] = np.nan
+    rawfix["reAligned"] = False
+    rawfix["previous_y"] = np.nan
+    rawfix["previous_line"] = np.nan
+    for i in range(rawfix.shape[0]):
+        rawfix.loc[i, "prob_return_sweep"] = RS(
+            i,
+            rawfix,
+            coords,
+            reqYthresh=True,
+            reqXthresh=True,
+            Ythresh=RSpar[0],
+            Xthresh=RSpar[1],
+            threshSimilar=RSpar[2],
+        )
+        if i > 0:
+            if (rawfix["prob_return_sweep"][i] < 1) & (rawfix["y"][i] > rawfix["y"][i - 1] + letterHeight / 2):
+                rawfix.loc[i, "prob_return_sweep"] = 1
+        rawfix.loc[i, "previous_x"] = rawfix["x"][i]
+        rawfix.loc[i, "previous_y"] = rawfix["y"][i]
+        if i > 0:
+            if rawfix["y"][i] < rawfix["y"][i - 1] - letterHeight / 2:
+                rawfix.loc[i, "prob_interline_saccade"] = 1
+            else:
+                rawfix.loc[i, "prob_interline_saccade"] = 0
+    RsweepFix = np.sort(
+        np.concatenate(
+            (np.where(rawfix["prob_return_sweep"] == 1)[0], np.where(rawfix["prob_interline_saccade"] == 1)[0])
+        )
+    )
+    for i in range(len(RsweepFix)):
+        if i == 0:
+            linePass = rawfix.loc[: RsweepFix[0] - 1]
+        elif i >= len(RsweepFix):
+            linePass = rawfix.loc[RsweepFix[-1] :]
+        else:
+            linePass = rawfix.loc[RsweepFix[i - 1] : RsweepFix[i] - 1]
+        if linePass.shape[0] == 1:
+            continue
+        avgYpos = linePass["y"].mean(skipna=True)
+        whichLine = min(range(len(lineCenter)), key=lambda index: abs(lineCenter[index] - avgYpos))
+        linePass.reset_index(inplace=True, drop=True)
+        for j in range(linePass.shape[0]):
+            onLine = (linePass["y"][j] >= ystart["2"][whichLine]) & (linePass["y"][j] <= yend["2"][whichLine])
+            if not onLine:
+                if linePass["y"][j] < ystart["2"][whichLine]:
+                    rawfix = reMap(
+                        rawfix, linePass.loc[j, "fixation_number"], coords_map, coords, newY=ystart["2"][whichLine] + 5
+                    )
+                else:
+                    rawfix = reMap(
+                        rawfix, linePass.loc[j, "fixation_number"], coords_map, coords, newY=yend["2"][whichLine] - 5
+                    )
+                rawfix.loc[linePass.loc[j, "fixation_number"], "reAligned"] = True
+            else:
+                rawfix.loc[linePass.loc[j, "fixation_number"], "reAligned"] = False
+    return rawfix
+def cleanData(
+    raw_fix,
+    algo_choice,
+    removeBlinks=True,
+    combineNearbySmallFix=True,
+    combineMethod="char",
+    combineDist=1,
+    removeSmallFix=True,
+    smallFixCutoff=80,
+    remove_duration_outliers=True,
+    outlierMethod="ms",
+    outlierCutoff=800,
+    keepRS=False,
+):
+    if combineNearbySmallFix:
+        nbefore = raw_fix.shape[0]
+        which_comb = []
+        for i, _ in enumerate(raw_fix):
+            prev_line_same = False
+            next_line_same = False
+            if (i > 0) and (i < nbefore - 1):
+                if combineMethod == "char":
+                    if (
+                        pd.isna(raw_fix[f"letternum_{algo_choice}"][i])
+                        or pd.isna(raw_fix[f"letternum_{algo_choice}"][i - 1])
+                        or pd.isna(raw_fix[f"letternum_{algo_choice}"][i + 1])
+                    ):
+                        continue
+                if raw_fix["duration"][i] < smallFixCutoff:
+                    if (
+                        not pd.isna(raw_fix[f"line_num_{algo_choice}"][i])
+                        and not pd.isna(raw_fix[f"line_num_{algo_choice}"][i - 1])
+                        and not pd.isna(raw_fix[f"line_num_{algo_choice}"][i + 1])
+                    ):
+                        if raw_fix[f"line_num_{algo_choice}"][i] == raw_fix[f"line_num_{algo_choice}"][i - 1]:
+                            prev_line_same = True
+                    if raw_fix[f"line_num_{algo_choice}"][i] == raw_fix[f"line_num_{algo_choice}"][i + 1]:
+                        next_line_same = True
+                    if combineMethod == "char":
+                        prev = abs(raw_fix[f"letternum_{algo_choice}"][i] - raw_fix[f"letternum_{algo_choice}"][i - 1])
+                        after = abs(raw_fix[f"letternum_{algo_choice}"][i] - raw_fix[f"letternum_{algo_choice}"][i + 1])
+                    else:
+                        prev = abs(round(raw_fix["x"][i]) - round(raw_fix["x"][i - 1]))
+                        after = abs(round(raw_fix["x"][i]) - round(raw_fix["x"][i + 1]))
+                    if prev <= combineDist:
+                        which_comb.append(i)
+                        if prev_line_same:
+                            raw_fix["duration"][i - 1] += raw_fix["duration"][i]
+                        if keepRS and (raw_fix["Rtn_sweep"][i] == 1):
+                            raw_fix["Rtn_sweep"][i - 1] = 1
+                    if after <= combineDist:
+                        which_comb.append(i)
+                        if next_line_same:
+                            raw_fix["duration"][i + 1] += raw_fix["duration"][i]
+                        if keepRS and (raw_fix["Rtn_sweep"][i] == 1):
+                            raw_fix["Rtn_sweep"][i + 1] = 1
+        which_comb = list(set(which_comb))
+        if len(which_comb) > 0:
+            raw_fix = raw_fix.drop(labels=which_comb, axis=0)
+    nstart = raw_fix.shape[0]
+    if removeBlinks:
+        raw_fix = raw_fix[~raw_fix["blink"]].copy()
+    nblink = nstart - raw_fix.shape[0]
+    if remove_duration_outliers:
+        if outlierMethod == "ms":
+            outIndices = np.where(raw_fix["duration"] > outlierCutoff)[0]
+            if len(outIndices) > 0:
+                raw_fix = raw_fix.drop(outIndices).copy()
+        elif outlierMethod == "std":
+            nSubCutoff, nOutliers = [], 0
+            subM = np.mean(raw_fix["duration"])
+            subSTD = np.std(raw_fix["duration"])
+            cutoff = subM + outlierCutoff * subSTD
+            nSubCutoff.append((len(np.where(raw_fix[raw_fix["duration"] > cutoff])[0])))
+            nOutliers = sum(nSubCutoff)
+    return raw_fix.reset_index(drop=True)
+def get_space(s):
+    if len(s) == 0 or s == " ":
+        return 1
+    else:
+        return None
+def get_num(string):
+    strr = "".join([i for i in string if i.isdigit()])
+    if len(strr) > 0:
+        return int(strr)
+    else:
+        ic(string)
+        return strr
+def parse_itemID(trialid):
+    I = re.search(r"I", trialid).start()
+    condition = get_num(trialid[:I])
+    D = re.search(r"D", trialid).start()
+    item = get_num(trialid[I + 1 : D])
+    depend = get_num(trialid[D:])
+    E = trialid[0]
+    return {"trialid": trialid, "condition": condition, "item": item, "depend": depend, "trial_is": E}
+def get_coord(str_input):
+    string = "\n".join(
+        [l.split("\t")[1].strip() for l in str_input if (("DELAY" not in l) & ("BUTTON" not in l) & ("REGION" in l))]
+    )
+    df = pd.read_table(
+        StringIO(string),
+        sep=" ",
+        names=["X" + str(i) for i in range(1, 12)],
+    )
+    df.loc[:, ["char_xmin", "char_ymin", "char_xmax", "char_ymax", "X11"]] = df[
+        ["char_xmin", "char_ymin", "char_xmax", "char_ymax", "X11"]
+    ].apply(pd.to_numeric, errors="coerce")
+    df.char = df.char.fillna("")
+    a = df[df["char"] == ""].index
+    for i in a:
+        if "space" not in df.columns:
+            df.loc[:, "space"] = None
+        df.at[i, "space"] = 1
+        if "char_xmin" in df.columns and "char_ymin" in df.columns:
+            df.at[i, "char_xmin"], df.at[i, "char_ymin"] = df.at[i, "char_ymin"], df.at[i, "char_xmax"]
+        if "char_ymin" in df.columns and "char_xmax" in df.columns:
+            df.at[i, "char_ymin"], df.at[i, "char_xmax"] = df.at[i, "char_xmax"], df.at[i, "char_ymax"]
+        if "char_xmax" in df.columns and "char_ymax" in df.columns:
+            df.at[i, "char_xmax"], df.at[i, "char_ymax"] = df.at[i, "char_ymax"], df.at[i, "X11"]
+    df = df.drop(columns=["X1", "X2", "X3", "X5"])
+    return df
+def map_sent(df):
+    sent_bnd = df[(df.char == ".") | (df.char == "?") | (df.char == "!")].index.tolist()
+    if len(sent_bnd) > 0:
+        sent = pd.Series([-1] * len(df))
+        for i, eidx in enumerate(sent_bnd):
+            sidx = sent_bnd[i - 1] if i > 0 else 0
+            if i == len(sent_bnd) - 1:
+                sent.loc[sidx:] = len(sent_bnd) - 1
+            else:
+                sent.loc[sidx:eidx] = i
+        df["sent"] = sent
+    else:
+        df["sent"] = 1
+    return df
+def map_line(df):
+    df = df[~pd.isnull(df["char_ymin"])].reset_index(names="index_temp")
+    lines = sorted(set(df["char_ymin"].values))
+    assigned_line = np.array([], dtype=int)
+    for i in range(len(lines)):
+        loc_lines = np.where(df["char_ymin"].values == lines[i])[0]
+        assigned_line = np.concatenate((assigned_line, np.full(len(loc_lines), fill_value=i)))
+        df.loc[len(assigned_line) - 1, "space"] = 2
+    df["assigned_line"] = assigned_line
+    df.set_index("index_temp", inplace=True)
+    return df
+def map_words(df):
+    curr_sent, curr_line, curr_word = 0, 0, 0
+    df["space"] == 2
+    for i in df.index:
+        newSent = curr_sent != df.loc[i, "sent"]
+        newLine = curr_line != df.loc[i, "assigned_line"]
+        df.loc[i, "word"] = curr_word
+        if df.loc[i, "char"] == "" and not newSent:
+            curr_word += 1
+            df.loc[i, "word"] = curr_word
+        elif newLine:
+            if df.loc[i, "char"] != ".":
+                curr_word += 1
+            df.loc[i, "word"] = curr_word
+            curr_line += 1
+        elif newSent:
+            curr_sent += 1
+            curr_word = 0
+            df.loc[i, "word"] = curr_word
+    return df
+def get_return_sweeps(raw_fix_new, coords, algo_choice):  # TODO Check if covered by popEye
+    currentSent = 0
+    currentLine = 0
+    maxLine = 0
+    inReg = False
+    curr_sent = np.zeros((max(coords["in_sentence_number"]) + 1, 4))
+    curr_sent[:, 0] = np.arange(0, max(coords["in_sentence_number"]) + 1)
+    diff_sent = coords["in_sentence_number"].diff().fillna(0)
+    last_words = coords.loc[np.where(diff_sent == 1), "in_word_number"]
+    curr_sent[:, 2] = np.append(last_words.values, coords["in_word_number"].iloc[-1])
+    for m in range(1, len(raw_fix_new)):
+        if not (pd.isna(raw_fix_new["char_line_EM"][m - 1]) or pd.isna(raw_fix_new["char_line_EM"][m])):
+            raw_fix_new.at[m, "sacc_len_EM"] = abs(raw_fix_new["char_line_EM"][m] - raw_fix_new["char_line_EM"][m - 1])
+        if not pd.isna(raw_fix_new["line_EM"][m]):
+            currentLine = raw_fix_new["line_EM"][m]
+        if currentLine > maxLine:
+            maxLine = currentLine
+            raw_fix_new.at[m, "Rtn_sweep"] = 1
+            if m < len(raw_fix_new) - 1:
+                sameLine = (
+                    not (pd.isna(raw_fix_new["line_EM"][m + 1]) or pd.isna(raw_fix_new["line_EM"][m]))
+                    and raw_fix_new["line_EM"][m + 1] == raw_fix_new["line_EM"][m]
+                )
+                if raw_fix_new["x"][m + 1] < raw_fix_new["x"][m]:
+                    raw_fix_new.at[m, "Rtn_sweep_type"] = "undersweep" if sameLine else None
+                else:
+                    raw_fix_new.at[m, "Rtn_sweep_type"] = "accurate" if sameLine else None
+            else:
+                raw_fix_new.at[m, "Rtn_sweep_type"] = np.nan
+        else:
+            raw_fix_new.at[m, "Rtn_sweep"] = 0
+        if not pd.isna(raw_fix_new["on_sentence_number_EM"][m]):
+            if m == 1:
+                curr_sent[int(raw_fix_new["on_sentence_number_EM"][m]), 2] = raw_fix_new["word_EM"][m]
+                raw_fix_new.at[m, "regress_EM"] = 0
+            else:
+                if raw_fix_new["word_EM"][m] > curr_sent[int(raw_fix_new["on_sentence_number_EM"][m]), 2]:
+                    curr_sent[int(raw_fix_new["on_sentence_number_EM"][m]), 2] = raw_fix_new["word_EM"][m]
+                    inReg = False
+                if currentSent < raw_fix_new["on_sentence_number_EM"][m]:
+                    curr_sent[currentSent, 3] = 1
+                    currentSent = raw_fix_new["on_sentence_number_EM"][m]
+                if (
+                    not pd.isna(raw_fix_new["on_sentence_number_EM"][m - 1])
+                    and raw_fix_new["on_sentence_number_EM"][m] > raw_fix_new["on_sentence_number_EM"][m - 1]
+                ):
+                    curr_sent[int(raw_fix_new["on_sentence_number_EM"][m - 1]), 3] = 1
+                if (
+                    raw_fix_new["word_EM"][m] < curr_sent[int(raw_fix_new["on_sentence_number_EM"][m]), 2]
+                    and curr_sent[int(raw_fix_new["on_sentence_number_EM"][m]), 3] == 0
+                ):
+                    raw_fix_new.at[m, "regress_EM"] = 1
+                    inReg = True
+                else:
+                    if curr_sent[int(raw_fix_new["on_sentence_number_EM"][m]), 3] == 0:
+                        raw_fix_new.at[m, "regress_EM"] = 0
+                        if (
+                            raw_fix_new["word_EM"][m] == curr_sent[int(raw_fix_new["on_sentence_number_EM"][m]), 2]
+                            and inReg
+                        ):
+                            raw_fix_new.at[m, "regress_EM"] = 1
+                    else:
+                        raw_fix_new.at[m, "regress_EM"] = 1
+                        raw_fix_new.at[m, "regress2nd_EM"] = 1
+                        inReg = True
+    return raw_fix_new
+def word_m_EM(n2):
+    sub_list = []
+    item_list = []
+    cond_list = []
+    seq_list = []
+    word_list = []
+    wordID_list = []
+    sent_list = []
+    FFD_list = []
+    SFD_list = []
+    GD_list = []
+    TVT_list = []
+    nfix1_list = []
+    nfix2_list = []
+    nfixAll_list = []
+    regress_list = []
+    o = n2["sent"].unique()
+    for k in range(len(o)):
+        q = n2[n2["sent"] == o[k]]
+        r = sorted(q["word"].unique())
+        for l in range(len(r)):
+            word_list.append(r[l])
+            sub_list.append(n2["sub"].iloc[0])
+            item_list.append(n2["item"].iloc[0])
+            seq_list.append(n2["seq"].iloc[0])
+            cond_list.append(n2["cond"].iloc[0])
+            sent_list.append(o[k])
+            p = q[q["word"] == r[l]]
+            if p.shape[0] == 0:
+                FFD_list.append(None)
+                SFD_list.append(None)
+                GD_list.append(None)
+                TVT_list.append(None)
+                nfix1_list.append(0)
+                nfix2_list.append(0)
+                nfixAll_list.append(0)
+            else:
+                p1 = p[p["regress"] == 0]
+                p2 = p[p["regress"] == 1]
+                if p1.shape[0] == 0:
+                    FFD_list.append(None)
+                    SFD_list.append(None)
+                    GD_list.append(None)
+                elif p1.shape[0] == 1:
+                    FFD_list.append(p1["fix_dur"].iloc[0])
+                    SFD_list.append(p1["fix_dur"].iloc[0])
+                    GD_list.append(p1["fix_dur"].iloc[0])
+                else:
+                    FFD_list.append(p1["fix_dur"].iloc[0])
+                    SFD_list.append(None)
+                    GD_list.append(p1["fix_dur"].sum())
+                TVT_list.append(p["fix_dur"].sum())
+                nfix1_list.append(p1.shape[0])
+                nfix2_list.append(p2.shape[0])
+                nfixAll_list.append(p1.shape[0] + p2.shape[0])
+                wordID_list.append(p["wordID"].iloc[0])
+                if nfix2_list[-1] == 0:
+                    regress_list.append(0)
+                else:
+                    regress_list.append(1)
+        dataT = pd.DataFrame(
+            {
+                "sub": sub_list,
+                "item": item_list,
+                "cond": cond_list,
+                "seq": seq_list,
+                "word": word_list,
+                "wordID": wordID_list,
+                "sent": sent_list,
+                "FFD": FFD_list,
+                "SFD": SFD_list,
+                "GD": GD_list,
+                "TVT": TVT_list,
+                "nfix1": nfix1_list,
+                "nfix2": nfix2_list,
+                "nfixAll": nfixAll_list,
+                "regress": regress_list,
+            }
+        )
+        sub_list = []
+        item_list = []
+        cond_list = []
+        seq_list = []
+        word_list = []
+        wordID_list = []
+        sent_list = []
+        FFD_list = []
+        SFD_list = []
+        GD_list = []
+        TVT_list = []
+        nfix1_list = []
+        nfix2_list = []
+        nfixAll_list = []
+        regress_list = []
+        if "dataN" in locals():
+            dataN = pd.concat([dataN, dataT], ignore_index=True)
+        else:
+            dataN = dataT
+def word_measures_EM(data, algo_choice, include_time_stamps=False):
+    add_blanks = False
+    if "blink" in data.columns:
+        required_columns = ["blink", "prev_blink", "after_blink"]
+        if all(col in data.columns for col in required_columns):
+            if (data["blink"] + data["prev_blink"] + data["after_blink"]).sum() == 0:
+                ic("Blinks appear to be already excluded! \n\n")
+            else:
+                add_blanks = True
+                ic("There appears to be valid blink data! We will map blinks to individual words. \n\n")
+                regress_blinks = data[data["blink"] == 1 & ~data["regress_EM"].isna()].index
+                if len(regress_blinks) < 1:
+                    BlinkFixTypeNotMapped = True
+                    ic(
+                        "Fixation type is not mapped for observations with blinks. Therefore, blinks can't be mapped in terms of 1st and 2nd pass reading."
+                    )
+                    ic(
+                        "Please note that, by default, blink fixation durations will also not be added to fixation duration measures for that word since it's assumed you will delete this word from analysis.\n"
+                    )
+                    ic("If you need to change this, see settings in the pre-processing function.\n\n")
+    data_n = pd.DataFrame()
+    o_k = sorted(np.unique(data[f"on_sentence_num_{algo_choice}"]))
+    for k, sent_k in enumerate(o_k):
+        q_k = data[data[f"on_sentence_num_{algo_choice}"] == sent_k]
+        p1_k = q_k[q_k["regress_EM"] == 0].copy()
+        p2_k = q_k[q_k["regress_EM"] == 1].copy()
+        RS_word = np.nan
+        check_next = False
+        if max(data[f"line_num_{algo_choice}"]) > 1:
+            for z, q_row in q_k.iterrows():
+                if not pd.isna(q_row["Rtn_sweep"]):
+                    if q_row["Rtn_sweep"] == 1:
+                        check_next = True
+                        RS_Word = (
+                            q_row[f"line_word_{algo_choice}"]
+                            if not pd.isna(q_row[f"line_word_{algo_choice}"])
+                            else np.nan
+                        )
+                    elif check_next and (pd.notna(q_row[f"line_word_{algo_choice}"])) and (q_row["regress_EM"]):
+                        break
+        word_l = []
+        sub_l = [data.loc[0, "subject"]] * len(q_k)
+        item_l = [data.loc[0, "item"]] * len(q_k)
+        cond_l = [1] * len(q_k)
+        for l, q_row in q_k.iterrows():
+            word_l.append(q_row[f"on_word_number_{algo_choice}"])
+            if add_blanks:
+                sum_1st_pass = (
+                    q_row["blink"]
+                    + p1_k[p1_k.index[q_row.name]]["prev_blink"]
+                    + p2_k[p2_k.index[q_row.name]]["after_blink"]
+                ).sum()
+                blinks_l = [0] * len(word_l)
+                if sum_1st_pass > 0:
+                    blinks_l[l] = 1
+        for l, q_row in q_k.iterrows():
+            word_line_l = [q_row[f"line_word_{algo_choice}"]]
+            line_l = [q_row[f"line_num_{algo_choice}"]]
+            if include_time_stamps:
+                EFIX_SFD_l = [np.nan]
+            for l, q_row in q_k.iterrows():
+                word_line_l.append(q_row[f"line_word_{algo_choice}"])
+                line_l.append(q_row[f"line_num_{algo_choice}"])
+        if include_time_stamps:
+            if len(p1_k) > 0:
+                if len(p1_k) == 1:
+                    EFIX_SFD_l.append(p1_k["stop_uncorrected"][0])
+        data_t = pd.DataFrame(
+            list(
+                zip(
+                    sub_l,
+                    item_l,
+                    cond_l,
+                    word_l,
+                    line_l,
+                )
+            ),
+            columns=[
+                "subject",
+                "item",
+                "condition",
+                f"on_word_number_{algo_choice}",
+                f"line_num_{algo_choice}",
+                "FFD",
+                "SFD",
+                "GD",
+                "TVT",
+                "nfix1",
+                "nfix2",
+                "nfixAll",
+                "regress",
+            ],
+        )
+        if add_blanks:
+            data_t["blinks_1stPass"] = blinks_l
+        data_n = pd.concat([data_n, data_t], ignore_index=True)
+    return data_n

eyekit_measures.py ADDED Viewed

	@@ -0,0 +1,194 @@

+import copy
+import eyekit as ek
+import numpy as np
+import pandas as pd
+from PIL import Image
+from icecream import ic
+import time
+ic.configureOutput(includeContext=True)
+MEASURES_DICT = {
+    "number_of_fixations": [],
+    "initial_fixation_duration": [],
+    "first_of_many_duration": [],
+    "total_fixation_duration": [],
+    "gaze_duration": [],
+    "go_past_duration": [],
+    "second_pass_duration": [],
+    "initial_landing_position": [],
+    "initial_landing_distance": [],
+    "landing_distances": [],
+    "number_of_regressions_in": [],
+}
+def get_fix_seq_and_text_block(
+    dffix,
+    trial,
+    x_txt_start=None,
+    y_txt_start=None,
+    font_face="Courier New",
+    font_size=None,
+    line_height=None,
+    use_corrected_fixations=True,
+    correction_algo="warp",
+):
+    if use_corrected_fixations and correction_algo is not None:
+        fixations_tuples = [
+            (
+                (x[1]["x"], x[1][f"y_{correction_algo}"], x[1]["corrected_start_time"], x[1]["corrected_end_time"])
+                if x[1]["corrected_start_time"] < x[1]["corrected_end_time"]
+                else (x[1]["x"], x[1]["y"], x[1]["corrected_start_time"], x[1]["corrected_end_time"] + 1)
+            )
+            for x in dffix.iterrows()
+        ]
+    else:
+        fixations_tuples = [
+            (
+                (x[1]["x"], x[1]["y"], x[1]["corrected_start_time"], x[1]["corrected_end_time"])
+                if x[1]["corrected_start_time"] < x[1]["corrected_end_time"]
+                else (x[1]["x"], x[1]["y"], x[1]["corrected_start_time"], x[1]["corrected_end_time"] + 1)
+            )
+            for x in dffix.iterrows()
+        ]
+    if "display_coords" in trial:
+        display_coords = trial["display_coords"]
+    else:
+        display_coords = (0, 0, 1920, 1080)
+    screen_size = ((display_coords[2] - display_coords[0]), (display_coords[3] - display_coords[1]))
+    try:
+        fixation_sequence = ek.FixationSequence(fixations_tuples)
+    except Exception as e:
+        ic(e)
+        ic(f"Creating fixation failed for {trial['trial_id']} {trial['filename']}")
+        return None, None, screen_size
+    y_diffs = np.unique(trial["line_heights"])
+    if len(y_diffs) == 1:
+        y_diff = y_diffs[0]
+    else:
+        y_diff = np.min(y_diffs)
+    chars_list = trial["chars_list"]
+    max_line = int(chars_list[-1]["assigned_line"])
+    words_on_lines = {x: [] for x in range(int(max_line) + 1)}
+    [words_on_lines[x["assigned_line"]].append(x["char"]) for x in chars_list]
+    sentence_list = ["".join([s for s in v]) for idx, v in words_on_lines.items()]
+    if x_txt_start is None:
+        x_txt_start = float(chars_list[0]["char_xmin"])
+    if y_txt_start is None:
+        y_txt_start = float(chars_list[0]["char_ymax"])
+    if font_face is None and "font" in trial:
+        font_face = trial["font"]
+    elif font_face is None:
+        font_face = "DejaVu Sans Mono"
+    if font_size is None and "font_size" in trial:
+        font_size = trial["font_size"]
+    elif font_size is None:
+        font_size = float(y_diff * 0.333)  # pixel to point conversion
+    if line_height is None:
+        line_height = float(y_diff)
+    textblock_input_dict = dict(
+        text=sentence_list,
+        position=(float(x_txt_start), float(y_txt_start)),
+        font_face=font_face,
+        line_height=line_height,
+        font_size=font_size,
+        anchor="left",
+        align="left",
+    )
+    textblock = ek.TextBlock(**textblock_input_dict)
+    ek.io.save(fixation_sequence, f'results/fixation_sequence_eyekit_{trial["trial_id"]}.json', compress=False)
+    ek.io.save(textblock, f'results/textblock_eyekit_{trial["trial_id"]}.json', compress=False)
+    return fixations_tuples, textblock_input_dict, screen_size
+def eyekit_plot(fixations_tuples, textblock_input_dict, screen_size):
+    textblock = ek.TextBlock(**textblock_input_dict)
+    img = ek.vis.Image(*screen_size)
+    img.draw_text_block(textblock)
+    for word in textblock.words():
+        img.draw_rectangle(word, color="hotpink")
+    fixation_sequence = ek.FixationSequence(fixations_tuples)
+    img.draw_fixation_sequence(fixation_sequence)
+    img.save("temp_eyekit_img.png", crop_margin=200)
+    img_png = Image.open("temp_eyekit_img.png")
+    return img_png
+def plot_with_measure(fixations_tuples, textblock_input_dict, screen_size, measure, use_characters=False):
+    textblock = ek.TextBlock(**textblock_input_dict)
+    fixation_sequence = ek.FixationSequence(fixations_tuples)
+    eyekitplot_img = eyekit_plot(fixations_tuples, textblock_input_dict, screen_size)
+    eyekitplot_img = ek.vis.Image(*screen_size)
+    eyekitplot_img.draw_text_block(textblock)
+    if use_characters:
+        measure_results = getattr(ek.measure, measure)(textblock.characters(), fixation_sequence)
+        enum = textblock.characters()
+    else:
+        measure_results = getattr(ek.measure, measure)(textblock.words(), fixation_sequence)
+        enum = textblock.words()
+    for word in enum:
+        eyekitplot_img.draw_rectangle(word, color="lightseagreen")
+        x = word.onset
+        y = word.y_br - 3
+        label = f"{measure_results[word.id]}"
+        eyekitplot_img.draw_annotation((x, y), label, color="lightseagreen", font_face="Arial bold", font_size=15)
+    eyekitplot_img.draw_fixation_sequence(fixation_sequence, color="gray")
+    eyekitplot_img.save("multiline_passage_piccol.png", crop_margin=100)
+    img_png = Image.open("multiline_passage_piccol.png")
+    return img_png
+def get_eyekit_measures(fixations_tuples, textblock_input_dict, trial, get_char_measures=False):
+    textblock = ek.TextBlock(**textblock_input_dict)
+    fixation_sequence = ek.FixationSequence(fixations_tuples)
+    measures = copy.deepcopy(MEASURES_DICT)
+    words = []
+    for w in textblock.words():
+        words.append(w.text)
+        for m in measures.keys():
+            measures[m].append(getattr(ek.measure, m)(w, fixation_sequence))
+    word_measures_df = pd.DataFrame(measures)
+    word_measures_df["word_number"] = np.arange(0, len(words))
+    word_measures_df["word"] = words
+    first_column = word_measures_df.pop("word")
+    word_measures_df.insert(0, "word", first_column)
+    first_column = word_measures_df.pop("word_number")
+    word_measures_df.insert(0, "word_number", first_column)
+    if "item" in trial and "item" not in word_measures_df.columns:
+        word_measures_df.insert(loc=0, column="item", value=trial["item"])
+    if "condition" in trial and "condition" not in word_measures_df.columns:
+        word_measures_df.insert(loc=0, column="condition", value=trial["condition"])
+    if "trial_id" in trial and "trial_id" not in word_measures_df.columns:
+        word_measures_df.insert(loc=0, column="trial_id", value=trial["trial_id"])
+    if "subject" in trial and "subject" not in word_measures_df.columns:
+        word_measures_df.insert(loc=0, column="subject", value=trial["subject"])
+    if get_char_measures:
+        measures = copy.deepcopy(MEASURES_DICT)
+        characters = []
+        for c in textblock.characters():
+            characters.append(c.text)
+            for m in measures.keys():
+                measures[m].append(getattr(ek.measure, m)(c, fixation_sequence))
+        character_measures_df = pd.DataFrame(measures)
+        character_measures_df["char_number"] = np.arange(0, len(characters))
+        character_measures_df["character"] = characters
+        first_column = character_measures_df.pop("character")
+        character_measures_df.insert(0, "character", first_column)
+        first_column = character_measures_df.pop("char_number")
+        character_measures_df.insert(0, "char_number", first_column)
+    else:
+        character_measures_df = None
+    return word_measures_df, character_measures_df

fixations_df_columns.md ADDED Viewed

	@@ -0,0 +1,88 @@

+#### Column names for Fixation Dataframe
+Some features were adapted from the popEye R package ([github](https://github.com/sascha2schroeder/popEye))
+The if the column depend on a line assignment then a _ALGORITHM_NAME will be at the end of the name.
+- subject: Subject name or ID
+- trial_id: Trial ID
+- item: Item ID
+- condition: Condition (if applicable)
+- fixation_number: Index of fixation
+- start_uncorrected: Starting timestamp of event as recorded by EyeLink
+- stop_uncorrected: End timestamp of event as recorded by EyeLink
+- start_time: Start time (in ms since start of the trial)
+- end_time: End time (in ms since start of the trial)
+- corrected_start_time: Start time of the event measured from to the first fixation
+- corrected_end_time: End time of the event measured from to the first fixation
+- x: Raw x position (in pixel)
+- y: Raw y position (in pixel)
+- pupil_size: Size of pupil as recorded by EyeLink
+- distance_in_char_widths: Horizontal distance to previous fixation in number of character widths
+- y_ALGORITHM: Corrected y position (in pixel), i.e. after line assignment
+- y_ALGORITHM_correction: Difference between corrected and raw y position (in pixel)
+- duration: Duration (in ms)
+- sac_in: Incoming saccade length (in letters)
+- sac_out: Outgoing saccade length (in letters)
+- type: Whether fixation is an outlier fixation ("out"), i.e. located outside the text area (see assign.outlier and assign.outlier.dist arguments)
+- blink: Whether a blink occured directly before or after the fixation
+- run: Number of run the fixation was assigned to (if applicable)
+- linerun: Number of run on the line the fixation was assigned to (if applicable)
+- line_num: Number of line the fixation was assigned to
+- line_change: Difference between the line of the current and the last fixation
+- line_let: Number of letter on line
+- line_word: Number of word on line
+- letternum: Number of letter in trial
+- letter: Name of Letter
+- on_word_number: Number of word in trial
+- on_word: Name of Word
+- ianum: Number of IA in trial
+- ia: Name of IA
+- on_sentence_num: Number of sentence in trial
+- on_sentence: Sentence text
+- sentence_nwords: Number of words in sentence
+- trial: Name trial (abbreviated)
+- trial_nwords: Number of words in trial
+- word_fix: Number of fixation on word
+- word_run: Number of run the word the word was read
+- word_runid: Number of the word run, the fixation belongs to
+- word_run_fix: Number of fixation within the run
+- word_firstskip: Whether word has been skipped during first-pass reading
+- word_refix: Whether word has been refixated with current fixation
+- word_launch: Launch site distance from the beginning of the word
+- word_land: Landing position with word
+- word_cland: Centered landing position (e.g., calculated from the center of the word)
+- word_reg_out: Whether a regression was made out of the word
+- word_reg_in: Whether a regression was made into the word
+- sentence_word: Number of word in sentence
+- sentence_fix: Number of fixation on sentence
+- sentence_run: Number of run on sentence
+- sentence_runid: Number of the sentence run, the fixation belongs to
+- sentence_firstskip: Whether the sentence has been skipped during first-pass reading
+- sentence_refix: Whether sentence was refixated wither current fixation
+- sentence_reg_out: Whether a regression was made out the sentence
+- sentence_reg_in: Whether a regression was made into the sentence
+- sac_in_ALGORITHM_NAME: Incoming saccade length (in letters)
+- sac_out_ALGORITHM_NAME: Outgoing saccade length (in letters)
+- blink_before: Whether a blink was recorded before the event
+- blink_after: Whether a blink was recorded after the event
+- blink: Whether a blink was recorded before or after the event
+- duration: Duration of the event
+- line_change_ALGORITHM_NAME: Difference between the line of the current and the previous fixation
+- on_word_number_ALGORITHM_NAME: Index of word that the fixation has been assigned to
+- num_words_in_sentence_ALGORITHM_NAME: Number of words in sentence to which fixation has been assigned
+- word_land_ALGORITHM_NAME: Landing position of fixation within word in number of letters
+- line_let_ALGORITHM_NAME: Index of letter on line
+- line_let_from_last_letter_ALGORITHM_NAME: Letter number on line counted from last letter of line
+- line_word_ALGORITHM_NAME: Number of word on line
+- sentence_word_ALGORITHM_NAME: Number of word in sentence
+- is_far_out_of_text_uncorrected: Indicates if a fixation is far outside the stimulus area as determined by the vertical and horizontal margins
+- line_let_previous_ALGORITHM_NAME: Index of letter on line for previous fixations
+- line_let_next_ALGORITHM_NAME: Index of letter on line for next fixations
+- sentence_reg_out_to_ALGORITHM_NAME: Whether a regression was made out of the sentence
+- sentence_reg_in_from_ALGORITHM_NAME: Whether a regression was made into the sentence
+- word_reg_in_from_ALGORITHM_NAME: Whether a regression was made out of the word
+- word_reg_out_to_ALGORITHM_NAME: Whether a regression was made into the word
+- word_firstskip_ALGORITHM_NAME: Whether word has been skipped during first-pass reading
+- sentence_firstskip_ALGORITHM_NAME: Whether the sentence has been skipped during first-pass reading
+- sentence_runid_ALGORITHM_NAME: Number of the sentence run, the fixation belongs to
+- sentence_run_fix_ALGORITHM_NAME:
+- angle_incoming: Angle based on position of previous fixation
+- angle_outgoing: Angle based on position of next fixation

item_df_columns.md ADDED Viewed

	@@ -0,0 +1,4 @@

+#### Column names for Character Dataframe
+- item: Item ID
+- condition: Condition (if applicable)
+- text: Stimulus text for item

loss_functions.py ADDED Viewed

	@@ -0,0 +1,97 @@

+import torch as t
+def corn_loss(logits, y_train, num_classes):
+    """Computes the CORN loss described in our forthcoming
+    'Deep Neural Networks for Rank Consistent Ordinal
+    Regression based on Conditional Probabilities'
+    manuscript.
+    Parameters
+    ----------
+    logits : torch.tensor, shape=(num_examples, num_classes-1)
+        Outputs of the CORN layer.
+    y_train : torch.tensor, shape=(num_examples)
+        Torch tensor containing the class labels.
+    num_classes : int
+        Number of unique class labels (class labels should start at 0).
+    Returns
+    ----------
+        loss : torch.tensor
+        A torch.tensor containing a single loss value.
+    Examples
+    ----------
+    >>> import torch
+    >>> from coral_pytorch.losses import corn_loss
+    >>> # Consider 8 training examples
+    >>> _  = torch.manual_seed(123)
+    >>> X_train = torch.rand(8, 99)
+    >>> y_train = torch.tensor([0, 1, 2, 2, 2, 3, 4, 4])
+    >>> NUM_CLASSES = 5
+    >>> #
+    >>> #
+    >>> # def __init__(self):
+    >>> corn_net = torch.nn.Linear(99, NUM_CLASSES-1)
+    >>> #
+    >>> #
+    >>> # def forward(self, X_train):
+    >>> logits = corn_net(X_train)
+    >>> logits.shape
+    torch.Size([8, 4])
+    >>> corn_loss(logits, y_train, NUM_CLASSES)
+    tensor(0.7127, grad_fn=<DivBackward0>)
+    https://github.com/Raschka-research-group/coral-pytorch/blob/c6ab93afd555a6eac708c95ae1feafa15f91c5aa/coral_pytorch/losses.py
+    """
+    sets = []
+    for i in range(num_classes - 1):
+        label_mask = y_train > i - 1
+        label_tensor = (y_train[label_mask] > i).to(t.int64)
+        sets.append((label_mask, label_tensor))
+    num_examples = 0
+    losses = 0.0
+    for task_index, s in enumerate(sets):
+        train_examples = s[0]
+        train_labels = s[1]
+        if len(train_labels) < 1:
+            continue
+        num_examples += len(train_labels)
+        pred = logits[train_examples, task_index]
+        loss = -t.sum(
+            t.nn.functional.logsigmoid(pred) * train_labels
+            + (t.nn.functional.logsigmoid(pred) - pred) * (1 - train_labels)
+        )
+        losses += loss
+    return losses / num_examples
+def corn_label_from_logits(logits):
+    """
+    Returns the predicted rank label from logits for a
+    network trained via the CORN loss.
+    Parameters
+    ----------
+    logits : torch.tensor, shape=(n_examples, n_classes)
+        Torch tensor consisting of logits returned by the
+        neural net.
+    Returns
+    ----------
+    labels : torch.tensor, shape=(n_examples)
+        Integer tensor containing the predicted rank (class) labels
+    Examples
+    ----------
+    >>> # 2 training examples, 5 classes
+    >>> logits = torch.tensor([[14.152, -6.1942, 0.47710, 0.96850],
+    ...                        [65.667, 0.303, 11.500, -4.524]])
+    >>> corn_label_from_logits(logits)
+    tensor([1, 3])
+    https://github.com/Raschka-research-group/coral-pytorch/blob/c6ab93afd555a6eac708c95ae1feafa15f91c5aa/coral_pytorch/dataset.py
+    """
+    probas = t.sigmoid(logits)
+    probas = t.cumprod(probas, dim=1)
+    predict_levels = probas > 0.5
+    predicted_labels = t.sum(predict_levels, dim=1)
+    return predicted_labels

models.py ADDED Viewed

	@@ -0,0 +1,892 @@

+import timm
+import os
+from typing import Any
+from pytorch_lightning.utilities.types import LRSchedulerTypeUnion
+import torch as t
+from torch import nn
+import transformers
+import pytorch_lightning as plight
+import torchmetrics
+import einops as eo
+from loss_functions import corn_loss, corn_label_from_logits
+t.set_float32_matmul_precision("medium")
+global_settings = dict(try_using_torch_compile=False)
+class EnsembleModel(plight.LightningModule):
+    def __init__(self, models_without_norm_df, models_with_norm_df, learning_rate=0.0002, use_simple_average=False):
+        super().__init__()
+        self.models_without_norm = nn.ModuleList(list(models_without_norm_df))
+        self.models_with_norm = nn.ModuleList(list(models_with_norm_df))
+        self.learning_rate = learning_rate
+        self.use_simple_average = use_simple_average
+        if not self.use_simple_average:
+            self.combiner = nn.Linear(
+                self.models_with_norm[0].num_classes * (len(self.models_with_norm) + len(self.models_without_norm)),
+                self.models_with_norm[0].num_classes,
+            )
+    def forward(self, x):
+        x_unnormed, x_normed = x
+        if not self.use_simple_average:
+            out_unnormed = t.cat([model.model_step(x_unnormed, 0)[0] for model in self.models_without_norm], dim=-1)
+            out_normed = t.cat([model.model_step(x_normed, 0)[0] for model in self.models_with_norm], dim=-1)
+            out_avg = self.combiner(t.cat((out_unnormed, out_normed), dim=-1))
+        else:
+            out_unnormed = [model.model_step(x_unnormed, 0)[0] for model in self.models_without_norm]
+            out_normed = [model.model_step(x_normed, 0)[0] for model in self.models_with_norm]
+            out_avg = (t.stack(out_unnormed + out_normed, dim=-1) / 2).mean(-1)
+        return {"out_avg": out_avg, "out_unnormed": out_unnormed, "out_normed": out_normed}, x_unnormed[-1]
+    def training_step(self, batch, batch_idx):
+        out, y = self(batch)
+        loss = self.models_with_norm[0]._get_loss(out["out_avg"], y, batch[0])
+        self.log("train_loss", loss, on_epoch=True, on_step=True, sync_dist=True)
+        return loss
+    def validation_step(self, batch, batch_idx):
+        out, y = self(batch)
+        preds, y_onecold, ignore_index_val = self.models_with_norm[0]._get_preds_reals(out["out_avg"], y)
+        acc = torchmetrics.functional.accuracy(
+            preds,
+            y_onecold.to(t.long),
+            ignore_index=ignore_index_val,
+            num_classes=self.models_with_norm[0].num_classes,
+            task="multiclass",
+        )
+        self.log("acc", acc * 100, prog_bar=True, sync_dist=True)
+        loss = self.models_with_norm[0]._get_loss(out["out_avg"], y, batch[0])
+        self.log("val_loss", loss, prog_bar=True, sync_dist=True)
+        return loss
+    def predict_step(self, batch, batch_idx: int, dataloader_idx: int = 0):
+        out, y = self(batch)
+        preds, y_onecold, ignore_index_val = self.models_with_norm[0]._get_preds_reals(out["out_avg"], y)
+        return preds, out, y_onecold
+    def configure_optimizers(self):
+        return t.optim.Adam(self.parameters(), lr=self.learning_rate)
+class TimmHeadReplace(nn.Module):
+    def __init__(self, pooling=None, in_channels=512, pooling_output_dimension=1, all_identity=False) -> None:
+        super().__init__()
+        if all_identity:
+            self.head = nn.Identity()
+            self.pooling = None
+        else:
+            self.pooling = pooling
+            if pooling is not None:
+                self.pooling_output_dimension = pooling_output_dimension
+                if self.pooling == "AdaptiveAvgPool2d":
+                    self.pooling_layer = nn.AdaptiveAvgPool2d(pooling_output_dimension)
+                elif self.pooling == "AdaptiveMaxPool2d":
+                    self.pooling_layer = nn.AdaptiveMaxPool2d(pooling_output_dimension)
+            self.head = nn.Flatten()
+    def forward(self, x, pre_logits=False):
+        if self.pooling is not None:
+            if self.pooling == "stack_avg_max_attn":
+                x = t.cat([layer(x) for layer in self.pooling_layer], dim=-1)
+            else:
+                x = self.pooling_layer(x)
+        return self.head(x)
+class CVModel(nn.Module):
+    def __init__(
+        self,
+        modelname,
+        in_shape,
+        num_classes,
+        loss_func,
+        last_activation: str,
+        input_padding_val=10,
+        char_dims=2,
+        max_seq_length=1000,
+    ) -> None:
+        super().__init__()
+        self.modelname = modelname
+        self.loss_func = loss_func
+        self.in_shape = in_shape
+        self.char_dims = char_dims
+        self.x_shape = in_shape
+        self.last_activation = last_activation
+        self.max_seq_length = max_seq_length
+        self.num_classes = num_classes
+        if self.loss_func == "OrdinalRegLoss":
+            self.out_shape = 1
+        else:
+            self.out_shape = num_classes
+        self.cv_model = timm.create_model(modelname, pretrained=True, num_classes=0)
+        self.cv_model.classifier = nn.Identity()
+        with t.inference_mode():
+            test_out = self.cv_model(t.ones(self.in_shape, dtype=t.float32))
+        self.cv_model_out_dim = test_out.shape[1]
+        self.cv_model.classifier = nn.Sequential(nn.Flatten(), nn.Linear(self.cv_model_out_dim, self.max_seq_length))
+        if self.out_shape == 1:
+            self.logit_norm = nn.Identity()
+            self.out_project = nn.Identity()
+        else:
+            self.logit_norm = nn.LayerNorm(self.max_seq_length)
+            self.out_project = nn.Linear(1, self.out_shape)
+        if last_activation == "Softmax":
+            self.final_activation = nn.Softmax(dim=-1)
+        elif last_activation == "Sigmoid":
+            self.final_activation = nn.Sigmoid()
+        elif last_activation == "LogSigmoid":
+            self.final_activation = nn.LogSigmoid()
+        elif last_activation == "Identity":
+            self.final_activation = nn.Identity()
+        else:
+            raise NotImplementedError(f"{last_activation} not implemented")
+    def forward(self, x):
+        if isinstance(x, list):
+            x = x[0]
+        x = self.cv_model(x)
+        x = self.cv_model.classifier(x).unsqueeze(-1)
+        x = self.out_project(x)
+        return self.final_activation(x)
+class LitModel(plight.LightningModule):
+    def __init__(
+        self,
+        in_shape: tuple,
+        hidden_dim: int,
+        num_attention_heads: int,
+        num_layers: int,
+        loss_func: str,
+        learning_rate: float,
+        weight_decay: float,
+        cfg: dict,
+        use_lr_warmup: bool,
+        use_reduce_on_plateau: bool,
+        track_gradient_histogram=False,
+        register_forw_hook=False,
+        char_dims=2,
+    ) -> None:
+        super().__init__()
+        if "only_use_2nd_input_stream" not in cfg:
+            cfg["only_use_2nd_input_stream"] = False
+        if "gamma_step_size" not in cfg:
+            cfg["gamma_step_size"] = 5
+        if "gamma_step_factor" not in cfg:
+            cfg["gamma_step_factor"] = 0.5
+        self.save_hyperparameters(
+            dict(
+                in_shape=in_shape,
+                hidden_dim=hidden_dim,
+                num_attention_heads=num_attention_heads,
+                num_layers=num_layers,
+                loss_func=loss_func,
+                learning_rate=learning_rate,
+                cfg=cfg,
+                x_shape=in_shape,
+                num_classes=cfg["num_classes"],
+                use_lr_warmup=use_lr_warmup,
+                num_warmup_steps=cfg["num_warmup_steps"],
+                use_reduce_on_plateau=use_reduce_on_plateau,
+                weight_decay=weight_decay,
+                track_gradient_histogram=track_gradient_histogram,
+                register_forw_hook=register_forw_hook,
+                char_dims=char_dims,
+                remove_timm_classifier_head_pooling=cfg["remove_timm_classifier_head_pooling"],
+                change_pooling_for_timm_head_to=cfg["change_pooling_for_timm_head_to"],
+                chars_conv_pooling_out_dim=cfg["chars_conv_pooling_out_dim"],
+            )
+        )
+        self.model_to_use = cfg["model_to_use"]
+        self.num_classes = cfg["num_classes"]
+        self.x_shape = in_shape
+        self.in_shape = in_shape
+        self.hidden_dim = hidden_dim
+        self.num_attention_heads = num_attention_heads
+        self.num_layers = num_layers
+        self.use_lr_warmup = use_lr_warmup
+        self.num_warmup_steps = cfg["num_warmup_steps"]
+        self.warmup_exponent = cfg["warmup_exponent"]
+        self.use_reduce_on_plateau = use_reduce_on_plateau
+        self.loss_func = loss_func
+        self.learning_rate = learning_rate
+        self.weight_decay = weight_decay
+        self.using_one_hot_targets = cfg["one_hot_y"]
+        self.track_gradient_histogram = track_gradient_histogram
+        self.register_forw_hook = register_forw_hook
+        if self.loss_func == "OrdinalRegLoss":
+            self.ord_reg_loss_max = cfg["ord_reg_loss_max"]
+            self.ord_reg_loss_min = cfg["ord_reg_loss_min"]
+        self.num_lin_layers = cfg["num_lin_layers"]
+        self.linear_activation = cfg["linear_activation"]
+        self.last_activation = cfg["last_activation"]
+        self.max_seq_length = cfg["manual_max_sequence_for_model"]
+        self.use_char_embed_info = cfg["use_embedded_char_pos_info"]
+        self.method_chars_into_model = cfg["method_chars_into_model"]
+        self.source_for_pretrained_cv_model = cfg["source_for_pretrained_cv_model"]
+        self.method_to_include_char_positions = cfg["method_to_include_char_positions"]
+        self.char_dims = char_dims
+        self.char_sequence_length = cfg["max_len_chars_list"] if self.use_char_embed_info else 0
+        self.chars_conv_lr_reduction_factor = cfg["chars_conv_lr_reduction_factor"]
+        if self.use_char_embed_info:
+            self.chars_bert_reduction_factor = cfg["chars_bert_reduction_factor"]
+        self.use_in_projection_bias = cfg["use_in_projection_bias"]
+        self.add_layer_norm_to_in_projection = cfg["add_layer_norm_to_in_projection"]
+        self.hidden_dropout_prob = cfg["hidden_dropout_prob"]
+        self.layer_norm_after_in_projection = cfg["layer_norm_after_in_projection"]
+        self.method_chars_into_model = cfg["method_chars_into_model"]
+        self.input_padding_val = cfg["input_padding_val"]
+        self.cv_char_modelname = cfg["cv_char_modelname"]
+        self.char_plot_shape = cfg["char_plot_shape"]
+        self.remove_timm_classifier_head_pooling = cfg["remove_timm_classifier_head_pooling"]
+        self.change_pooling_for_timm_head_to = cfg["change_pooling_for_timm_head_to"]
+        self.chars_conv_pooling_out_dim = cfg["chars_conv_pooling_out_dim"]
+        self.add_layer_norm_to_char_mlp = cfg["add_layer_norm_to_char_mlp"]
+        if "profile_torch_run" in cfg:
+            self.profile_torch_run = cfg["profile_torch_run"]
+        else:
+            self.profile_torch_run = False
+        if self.loss_func == "OrdinalRegLoss":
+            self.out_shape = 1
+        else:
+            self.out_shape = cfg["num_classes"]
+        if not self.hparams.cfg["only_use_2nd_input_stream"]:
+            if (
+                self.method_chars_into_model == "dense"
+                and self.use_char_embed_info
+                and self.method_to_include_char_positions == "concat"
+            ):
+                self.project = nn.Linear(self.x_shape[-1], self.hidden_dim // 2, bias=self.use_in_projection_bias)
+            elif (
+                self.method_chars_into_model == "bert"
+                and self.use_char_embed_info
+                and self.method_to_include_char_positions == "concat"
+            ):
+                self.hidden_dim_chars = self.hidden_dim // 2
+                self.project = nn.Linear(self.x_shape[-1], self.hidden_dim_chars, bias=self.use_in_projection_bias)
+            elif (
+                self.method_chars_into_model == "resnet"
+                and self.method_to_include_char_positions == "concat"
+                and self.use_char_embed_info
+            ):
+                self.project = nn.Linear(self.x_shape[-1], self.hidden_dim // 2, bias=self.use_in_projection_bias)
+            elif self.model_to_use == "cv_only_model":
+                self.project = nn.Identity()
+            else:
+                self.project = nn.Linear(self.x_shape[-1], self.hidden_dim, bias=self.use_in_projection_bias)
+            if self.add_layer_norm_to_in_projection:
+                self.project = nn.Sequential(
+                    nn.Linear(self.project.in_features, self.project.out_features, bias=self.use_in_projection_bias),
+                    nn.LayerNorm(self.project.out_features),
+                )
+        if hasattr(self, "project") and "posix" in os.name and global_settings["try_using_torch_compile"]:
+            self.project = t.compile(self.project)
+        if self.use_char_embed_info:
+            self._create_char_model()
+        if self.layer_norm_after_in_projection:
+            if self.hparams.cfg["only_use_2nd_input_stream"]:
+                self.layer_norm_in = nn.LayerNorm(self.hidden_dim // 2)
+            else:
+                self.layer_norm_in = nn.LayerNorm(self.hidden_dim)
+            if "posix" in os.name and global_settings["try_using_torch_compile"]:
+                self.layer_norm_in = t.compile(self.layer_norm_in)
+        self._create_main_seq_model(cfg)
+        if register_forw_hook:
+            self.register_hooks()
+        if self.hparams.cfg["only_use_2nd_input_stream"]:
+            linear_in_dim = self.hidden_dim // 2
+        else:
+            linear_in_dim = self.hidden_dim
+        if self.num_lin_layers == 1:
+            self.linear = nn.Linear(linear_in_dim, self.out_shape)
+        else:
+            lin_layers = []
+            for _ in range(self.num_lin_layers - 1):
+                lin_layers.extend(
+                    [
+                        nn.Linear(linear_in_dim, linear_in_dim),
+                        getattr(nn, self.linear_activation)(),
+                    ]
+                )
+            self.linear = nn.Sequential(*lin_layers, nn.Linear(linear_in_dim, self.out_shape))
+        if "posix" in os.name and global_settings["try_using_torch_compile"]:
+            self.linear = t.compile(self.linear)
+        if self.last_activation == "Softmax":
+            self.final_activation = nn.Softmax(dim=-1)
+        elif self.last_activation == "Sigmoid":
+            self.final_activation = nn.Sigmoid()
+        elif self.last_activation == "Identity":
+            self.final_activation = nn.Identity()
+        else:
+            raise NotImplementedError(f"{self.last_activation} not implemented")
+        if self.profile_torch_run:
+            self.profilerr = t.profiler.profile(
+                schedule=t.profiler.schedule(wait=1, warmup=10, active=10, repeat=1),
+                on_trace_ready=t.profiler.tensorboard_trace_handler("tblogs"),
+                with_stack=True,
+                record_shapes=True,
+                profile_memory=False,
+            )
+    def _create_main_seq_model(self, cfg):
+        if self.hparams.cfg["only_use_2nd_input_stream"]:
+            hidden_dim = self.hidden_dim // 2
+        else:
+            hidden_dim = self.hidden_dim
+        if self.model_to_use == "BERT":
+            self.bert_config = transformers.BertConfig(
+                vocab_size=self.x_shape[-1],
+                hidden_size=hidden_dim,
+                num_hidden_layers=self.num_layers,
+                intermediate_size=hidden_dim,
+                num_attention_heads=self.num_attention_heads,
+                max_position_embeddings=self.max_seq_length,
+            )
+            self.bert_model = transformers.BertModel(self.bert_config)
+        elif self.model_to_use == "cv_only_model":
+            self.bert_model = CVModel(
+                modelname=cfg["cv_modelname"],
+                in_shape=self.in_shape,
+                num_classes=cfg["num_classes"],
+                loss_func=cfg["loss_function"],
+                last_activation=cfg["last_activation"],
+                input_padding_val=cfg["input_padding_val"],
+                char_dims=self.char_dims,
+                max_seq_length=cfg["manual_max_sequence_for_model"],
+            )
+        else:
+            raise NotImplementedError(f"{self.model_to_use} not implemented")
+        if "posix" in os.name and global_settings["try_using_torch_compile"]:
+            self.bert_model = t.compile(self.bert_model)
+        return 0
+    def _create_char_model(self):
+        if self.method_chars_into_model == "dense":
+            self.chars_project_0 = nn.Linear(self.char_dims, 1, bias=self.use_in_projection_bias)
+            if "posix" in os.name and global_settings["try_using_torch_compile"]:
+                self.chars_project_0 = t.compile(self.chars_project_0)
+            if self.method_to_include_char_positions == "concat":
+                self.chars_project_1 = nn.Linear(
+                    self.char_sequence_length, self.hidden_dim // 2, bias=self.use_in_projection_bias
+                )
+            else:
+                self.chars_project_1 = nn.Linear(
+                    self.char_sequence_length, self.hidden_dim, bias=self.use_in_projection_bias
+                )
+            if "posix" in os.name and global_settings["try_using_torch_compile"]:
+                self.chars_project_1 = t.compile(self.chars_project_1)
+        elif not self.method_chars_into_model == "resnet":
+            self.chars_project = nn.Linear(self.char_dims, self.hidden_dim_chars, bias=self.use_in_projection_bias)
+            if "posix" in os.name and global_settings["try_using_torch_compile"]:
+                self.chars_project = t.compile(self.chars_project)
+        if self.method_chars_into_model == "bert":
+            if not hasattr(self, "hidden_dim_chars"):
+                if self.hidden_dim // self.chars_bert_reduction_factor > 1:
+                    self.hidden_dim_chars = self.hidden_dim // self.chars_bert_reduction_factor
+                else:
+                    self.hidden_dim_chars = self.hidden_dim
+            self.num_attention_heads_chars = self.hidden_dim_chars // (self.hidden_dim // self.num_attention_heads)
+            self.chars_bert_config = transformers.BertConfig(
+                vocab_size=self.x_shape[-1],
+                hidden_size=self.hidden_dim_chars,
+                num_hidden_layers=self.num_layers,
+                intermediate_size=self.hidden_dim_chars,
+                num_attention_heads=self.num_attention_heads_chars,
+                max_position_embeddings=self.char_sequence_length + 1,
+                num_labels=1,
+            )
+            self.chars_bert = transformers.BertForSequenceClassification(self.chars_bert_config)
+            if "posix" in os.name and global_settings["try_using_torch_compile"]:
+                self.chars_bert = t.compile(self.chars_bert)
+            self.chars_project_class_output = nn.Linear(1, self.hidden_dim_chars, bias=self.use_in_projection_bias)
+            if "posix" in os.name and global_settings["try_using_torch_compile"]:
+                self.chars_project_class_output = t.compile(self.chars_project_class_output)
+        elif self.method_chars_into_model == "resnet":
+            if self.source_for_pretrained_cv_model == "timm":
+                self.chars_conv = timm.create_model(
+                    self.cv_char_modelname,
+                    pretrained=True,
+                    num_classes=0,  # remove classifier nn.Linear
+                )
+                if self.remove_timm_classifier_head_pooling:
+                    self.chars_conv.head = TimmHeadReplace(all_identity=True)
+                    with t.inference_mode():
+                        test_out = self.chars_conv(
+                            t.ones((1, 3, self.char_plot_shape[0], self.char_plot_shape[1]), dtype=t.float32)
+                        )
+                    if test_out.ndim > 3:
+                        self.chars_conv.head = TimmHeadReplace(
+                            self.change_pooling_for_timm_head_to,
+                            test_out.shape[1],
+                        )
+            elif self.source_for_pretrained_cv_model == "huggingface":
+                self.chars_conv = transformers.AutoModelForImageClassification.from_pretrained(self.cv_char_modelname)
+            elif self.source_for_pretrained_cv_model == "torch_hub":
+                self.chars_conv = t.hub.load(*self.cv_char_modelname.split(","))
+            if hasattr(self.chars_conv, "classifier"):
+                self.chars_conv.classifier = nn.Identity()
+            elif hasattr(self.chars_conv, "cls_classifier"):
+                self.chars_conv.cls_classifier = nn.Identity()
+            elif hasattr(self.chars_conv, "fc"):
+                self.chars_conv.fc = nn.Identity()
+            if hasattr(self.chars_conv, "distillation_classifier"):
+                self.chars_conv.distillation_classifier = nn.Identity()
+            with t.inference_mode():
+                test_out = self.chars_conv(
+                    t.ones((1, 3, self.char_plot_shape[0], self.char_plot_shape[1]), dtype=t.float32)
+                )
+            if hasattr(test_out, "last_hidden_state"):
+                self.chars_conv_out_dim = test_out.last_hidden_state.shape[1]
+            elif hasattr(test_out, "logits"):
+                self.chars_conv_out_dim = test_out.logits.shape[1]
+            elif isinstance(test_out, list):
+                self.chars_conv_out_dim = test_out[0].shape[1]
+            else:
+                self.chars_conv_out_dim = test_out.shape[1]
+            char_lin_layers = [nn.Flatten(), nn.Linear(self.chars_conv_out_dim, self.hidden_dim // 2)]
+            if self.add_layer_norm_to_char_mlp:
+                char_lin_layers.append(nn.LayerNorm(self.hidden_dim // 2))
+            self.chars_classifier = nn.Sequential(*char_lin_layers)
+            if hasattr(self.chars_conv, "distillation_classifier"):
+                self.chars_conv.distillation_classifier = nn.Sequential(
+                    nn.Flatten(), nn.Linear(self.chars_conv_out_dim, self.hidden_dim // 2)
+                )
+            if "posix" in os.name and global_settings["try_using_torch_compile"]:
+                self.chars_classifier = t.compile(self.chars_classifier)
+            if "posix" in os.name and global_settings["try_using_torch_compile"]:
+                self.chars_conv = t.compile(self.chars_conv)
+        return 0
+    def register_hooks(self):
+        def add_to_tb(layer):
+            def hook(model, input, output):
+                if hasattr(output, "detach"):
+                    for logger in self.loggers:
+                        if hasattr(logger.experiment, "add_histogram"):
+                            logger.experiment.add_histogram(
+                                tag=f"{layer}_{str(list(output.shape))}",
+                                values=output.detach(),
+                                global_step=self.trainer.global_step,
+                            )
+            return hook
+        for layer_id, layer in dict([*self.named_modules()]).items():
+            layer.register_forward_hook(add_to_tb(f"act_{layer_id}"))
+    def on_after_backward(self) -> None:
+        if self.track_gradient_histogram:
+            if self.trainer.global_step % 200 == 0:
+                for logger in self.loggers:
+                    if hasattr(logger.experiment, "add_histogram"):
+                        for layer_id, layer in dict([*self.named_modules()]).items():
+                            parameters = layer.parameters()
+                            for idx2, p in enumerate(parameters):
+                                grad_val = p.grad
+                                if grad_val is not None:
+                                    grad_name = f"grad_{idx2}_{layer_id}_{str(list(p.grad.shape))}"
+                                    logger.experiment.add_histogram(
+                                        tag=grad_name, values=grad_val, global_step=self.trainer.global_step
+                                    )
+        return super().on_after_backward()
+    def _fold_in_seq_dim(self, out, y):
+        batch_size, seq_len, num_classes = out.shape
+        out = eo.rearrange(out, "b s c -> (b s) c", s=seq_len)
+        if y is None:
+            return out, None
+        if len(y.shape) > 2:
+            y = eo.rearrange(y, "b s c -> (b s) c", s=seq_len)
+        else:
+            y = eo.rearrange(y, "b s -> (b s)", s=seq_len)
+        return out, y
+    def _get_loss(self, out, y, batch):
+        attention_mask = batch[-2]
+        if self.loss_func == "BCELoss":
+            if self.last_activation == "Identity":
+                loss = t.nn.functional.binary_cross_entropy_with_logits(out, y, reduction="none")
+            else:
+                loss = t.nn.functional.binary_cross_entropy(out, y, reduction="none")
+            replace_tensor = t.zeros(loss[1, 1, :].shape, device=loss.device, dtype=loss.dtype, requires_grad=False)
+            loss[~attention_mask.bool()] = replace_tensor
+            loss = loss.mean()
+        elif self.loss_func == "CrossEntropyLoss":
+            if len(out.shape) > 2:
+                out, y = self._fold_in_seq_dim(out, y)
+                loss = t.nn.functional.cross_entropy(out, y, reduction="mean", ignore_index=-100)
+            else:
+                loss = t.nn.functional.cross_entropy(out, y, reduction="mean", ignore_index=-100)
+        elif self.loss_func == "OrdinalRegLoss":
+            loss = t.nn.functional.mse_loss(out, y, reduction="none")
+            loss = loss[attention_mask.bool()].sum() * 10.0 / attention_mask.sum()
+        elif self.loss_func == "corn_loss":
+            out, y = self._fold_in_seq_dim(out, y)
+            loss = corn_loss(out, y.squeeze(), self.out_shape)
+        else:
+            raise ValueError("Loss Function not reckognized")
+        return loss
+    def training_step(self, batch, batch_idx):
+        if self.profile_torch_run:
+            self.profilerr.step()
+        out, y = self.model_step(batch, batch_idx)
+        loss = self._get_loss(out, y, batch)
+        self.log("train_loss", loss, on_epoch=True, on_step=True, sync_dist=True)
+        return loss
+    def forward(*args):
+        return forward(args[0], args[1:])
+    def model_step(self, batch, batch_idx):
+        out = self.forward(batch)
+        return out, batch[-1]
+    def optimizer_step(
+        self,
+        epoch,
+        batch_idx,
+        optimizer,
+        optimizer_closure,
+    ):
+        optimizer.step(closure=optimizer_closure)
+        if self.use_lr_warmup and self.hparams["cfg"]["lr_scheduling"] != "OneCycleLR":
+            if self.trainer.global_step < self.num_warmup_steps:
+                lr_scale = min(1.0, float(self.trainer.global_step + 1) / self.num_warmup_steps) ** self.warmup_exponent
+                for pg in optimizer.param_groups:
+                    pg["lr"] = lr_scale * self.hparams.learning_rate
+        if self.trainer.global_step % 10 == 0 or self.trainer.global_step == 0:
+            for idx, pg in enumerate(optimizer.param_groups):
+                self.log(f"lr_{idx}", pg["lr"], prog_bar=True, sync_dist=True)
+    def lr_scheduler_step(self, scheduler: LRSchedulerTypeUnion, metric: Any | None) -> None:
+        if self.use_lr_warmup and self.hparams["cfg"]["lr_scheduling"] != "OneCycleLR":
+            if self.trainer.global_step > self.num_warmup_steps:
+                if metric is None:
+                    scheduler.step()
+                else:
+                    scheduler.step(metric)
+        else:
+            if metric is None:
+                scheduler.step()
+            else:
+                scheduler.step(metric)
+    def _get_preds_reals(self, out, y):
+        if self.loss_func == "corn_loss":
+            seq_len = out.shape[1]
+            out, y = self._fold_in_seq_dim(out, y)
+            preds = corn_label_from_logits(out)
+            preds = eo.rearrange(preds, "(b s) -> b s", s=seq_len)
+            if y is not None:
+                y = eo.rearrange(y.squeeze(), "(b s) -> b s", s=seq_len)
+        elif self.loss_func == "OrdinalRegLoss":
+            preds = out * (self.ord_reg_loss_max - self.ord_reg_loss_min)
+            preds = (preds + self.ord_reg_loss_min).round().to(t.long)
+        else:
+            preds = t.argmax(out, dim=-1)
+        if y is None:
+            return preds, y, -100
+        else:
+            if self.using_one_hot_targets:
+                y_onecold = t.argmax(y, dim=-1)
+                ignore_index_val = 0
+            elif self.loss_func == "OrdinalRegLoss":
+                y_onecold = (y * self.num_classes).round().to(t.long)
+                y_onecold = y * (self.ord_reg_loss_max - self.ord_reg_loss_min)
+                y_onecold = (y_onecold + self.ord_reg_loss_min).round().to(t.long)
+                ignore_index_val = t.min(y_onecold).to(t.long)
+            else:
+                y_onecold = y
+                ignore_index_val = -100
+            if len(preds.shape) > len(y_onecold.shape):
+                preds = preds.squeeze()
+            return preds, y_onecold, ignore_index_val
+    def validation_step(self, batch, batch_idx):
+        out, y = self.model_step(batch, batch_idx)
+        preds, y_onecold, ignore_index_val = self._get_preds_reals(out, y)
+        if self.loss_func == "OrdinalRegLoss":
+            y_onecold = y_onecold.flatten()
+            preds = preds.flatten()[y_onecold != ignore_index_val]
+            y_onecold = y_onecold[y_onecold != ignore_index_val]
+            acc = (preds == y_onecold).sum() / len(y_onecold)
+        else:
+            acc = torchmetrics.functional.accuracy(
+                preds,
+                y_onecold.to(t.long),
+                ignore_index=ignore_index_val,
+                num_classes=self.num_classes,
+                task="multiclass",
+            )
+            self.log("acc", acc * 100, prog_bar=True, sync_dist=True)
+        loss = self._get_loss(out, y, batch)
+        self.log("val_loss", loss, prog_bar=True, sync_dist=True)
+        return loss
+    def predict_step(self, batch, batch_idx):
+        out, y = self.model_step(batch, batch_idx)
+        preds, y_onecold, ignore_index_val = self._get_preds_reals(out, y)
+        return preds, y_onecold
+    def configure_optimizers(self):
+        params = list(self.named_parameters())
+        def is_chars_conv(n):
+            if "chars_conv" not in n:
+                return False
+            if "chars_conv" in n and "classifier" in n:
+                return False
+            else:
+                return True
+        grouped_parameters = [
+            {
+                "params": [p for n, p in params if is_chars_conv(n)],
+                "lr": self.learning_rate / self.chars_conv_lr_reduction_factor,
+                "weight_decay": self.weight_decay,
+            },
+            {
+                "params": [p for n, p in params if not is_chars_conv(n)],
+                "lr": self.learning_rate,
+                "weight_decay": self.weight_decay,
+            },
+        ]
+        opti = t.optim.AdamW(grouped_parameters, lr=self.learning_rate, weight_decay=self.weight_decay)
+        if self.use_reduce_on_plateau:
+            opti_dict = {
+                "optimizer": opti,
+                "lr_scheduler": {
+                    "scheduler": t.optim.lr_scheduler.ReduceLROnPlateau(opti, mode="min", patience=2, factor=0.5),
+                    "monitor": "val_loss",
+                    "frequency": 1,
+                    "interval": "epoch",
+                },
+            }
+            return opti_dict
+        else:
+            cfg = self.hparams["cfg"]
+            if cfg["use_reduce_on_plateau"]:
+                scheduler = None
+            elif cfg["lr_scheduling"] == "multistep":
+                scheduler = t.optim.lr_scheduler.MultiStepLR(
+                    opti, milestones=cfg["multistep_milestones"], gamma=cfg["gamma_multistep"], verbose=False
+                )
+                interval = "step" if cfg["use_training_steps_for_end_and_lr_decay"] else "epoch"
+            elif cfg["lr_scheduling"] == "StepLR":
+                scheduler = t.optim.lr_scheduler.StepLR(
+                    opti, step_size=cfg["gamma_step_size"], gamma=cfg["gamma_step_factor"]
+                )
+                interval = "step" if cfg["use_training_steps_for_end_and_lr_decay"] else "epoch"
+            elif cfg["lr_scheduling"] == "anneal":
+                scheduler = t.optim.lr_scheduler.CosineAnnealingLR(
+                    opti, 250, eta_min=cfg["min_lr_anneal"], last_epoch=-1, verbose=False
+                )
+                interval = "step"
+            elif cfg["lr_scheduling"] == "ExponentialLR":
+                scheduler = t.optim.lr_scheduler.ExponentialLR(opti, gamma=cfg["lr_sched_exp_fac"])
+                interval = "step"
+            else:
+                scheduler = None
+            if scheduler is None:
+                return [opti]
+            else:
+                opti_dict = {
+                    "optimizer": opti,
+                    "lr_scheduler": {
+                        "scheduler": scheduler,
+                        "monitor": "global_step",
+                        "frequency": 1,
+                        "interval": interval,
+                    },
+                }
+                return opti_dict
+    def on_fit_start(self) -> None:
+        if self.profile_torch_run:
+            self.profilerr.start()
+        return super().on_fit_start()
+    def on_fit_end(self) -> None:
+        if self.profile_torch_run:
+            self.profilerr.stop()
+        return super().on_fit_end()
+def prep_model_input(self, batch):
+    if len(batch) == 1:
+        batch = batch[0]
+    if self.use_char_embed_info:
+        if len(batch) == 5:
+            x, chars_coords, ims, attention_mask, _ = batch
+        elif batch[1].ndim == 4:
+            x, ims, attention_mask, _ = batch
+        else:
+            x, chars_coords, attention_mask, _ = batch
+        padding_list = None
+    else:
+        if len(batch) > 3:
+            x = batch[0]
+            y = batch[-1]
+            attention_mask = batch[1]
+        else:
+            x, attention_mask, y = batch
+    if self.model_to_use != "cv_only_model" and not self.hparams.cfg["only_use_2nd_input_stream"]:
+        x_embedded = self.project(x)
+    else:
+        x_embedded = x
+    if self.use_char_embed_info:
+        if self.method_chars_into_model == "dense":
+            bool_mask = chars_coords == self.input_padding_val
+            bool_mask = bool_mask[:, :, 0]
+            chars_coords_projected = self.chars_project_0(chars_coords).squeeze(-1)
+            chars_coords_projected = chars_coords_projected * bool_mask
+            if self.chars_project_1.in_features == chars_coords_projected.shape[-1]:
+                chars_coords_projected = self.chars_project_1(chars_coords_projected)
+            else:
+                chars_coords_projected = chars_coords_projected.mean(dim=-1)
+                chars_coords_projected = chars_coords_projected.unsqueeze(1).repeat(1, x_embedded.shape[2])
+        elif self.method_chars_into_model == "bert":
+            chars_mask = chars_coords != self.input_padding_val
+            chars_mask = t.cat(
+                (
+                    t.ones(chars_mask[:, :1, 0].shape, dtype=t.long, device=chars_coords.device),
+                    chars_mask[:, :, 0].to(t.long),
+                ),
+                dim=1,
+            )
+            chars_coords_projected = self.chars_project(chars_coords)
+            position_ids = t.arange(
+                0, chars_coords_projected.shape[1] + 1, dtype=t.long, device=chars_coords_projected.device
+            )
+            token_type_ids = t.zeros(
+                (chars_coords_projected.size()[0], chars_coords_projected.size()[1] + 1),
+                dtype=t.long,
+                device=chars_coords_projected.device,
+            )  # +1 for CLS
+            chars_coords_projected = t.cat(
+                (t.ones_like(chars_coords_projected[:, :1, :]), chars_coords_projected), dim=1
+            )  # to add CLS token
+            chars_coords_projected = self.chars_bert(
+                position_ids=position_ids,
+                inputs_embeds=chars_coords_projected,
+                token_type_ids=token_type_ids,
+                attention_mask=chars_mask,
+            )
+            if hasattr(chars_coords_projected, "last_hidden_state"):
+                chars_coords_projected = chars_coords_projected.last_hidden_state[:, 0, :]
+            elif hasattr(chars_coords_projected, "logits"):
+                chars_coords_projected = chars_coords_projected.logits
+            else:
+                chars_coords_projected = chars_coords_projected.hidden_states[-1][:, 0, :]
+        elif self.method_chars_into_model == "resnet":
+            chars_conv_out = self.chars_conv(ims)
+            if isinstance(chars_conv_out, list):
+                chars_conv_out = chars_conv_out[0]
+            if hasattr(chars_conv_out, "logits"):
+                chars_conv_out = chars_conv_out.logits
+            chars_coords_projected = self.chars_classifier(chars_conv_out)
+        chars_coords_projected = chars_coords_projected.unsqueeze(1).repeat(1, x_embedded.shape[1], 1)
+        if hasattr(self, "chars_project_class_output"):
+            chars_coords_projected = self.chars_project_class_output(chars_coords_projected)
+        if self.hparams.cfg["only_use_2nd_input_stream"]:
+            x_embedded = chars_coords_projected
+        elif self.method_to_include_char_positions == "concat":
+            x_embedded = t.cat((x_embedded, chars_coords_projected), dim=-1)
+        else:
+            x_embedded = x_embedded + chars_coords_projected
+    return x_embedded, attention_mask
+def forward(self, batch):
+    prepped_input = prep_model_input(self, batch)
+    if len(batch) > 5:
+        x_embedded, padding_list, attention_mask, attention_mask_for_prediction = prepped_input
+    elif len(batch) > 2:
+        x_embedded, attention_mask = prepped_input
+    else:
+        x_embedded = prepped_input[0]
+        attention_mask = prepped_input[-1]
+    position_ids = t.arange(0, x_embedded.shape[1], dtype=t.long, device=x_embedded.device)
+    token_type_ids = t.zeros(x_embedded.size()[:-1], dtype=t.long, device=x_embedded.device)
+    if self.layer_norm_after_in_projection:
+        x_embedded = self.layer_norm_in(x_embedded)
+    if self.model_to_use == "LSTM":
+        bert_out = self.bert_model(x_embedded)
+    elif self.model_to_use in ["ProphetNet", "T5", "FunnelModel"]:
+        bert_out = self.bert_model(inputs_embeds=x_embedded, attention_mask=attention_mask)
+    elif self.model_to_use == "xBERT":
+        bert_out = self.bert_model(x_embedded, mask=attention_mask.to(bool))
+    elif self.model_to_use == "cv_only_model":
+        bert_out = self.bert_model(x_embedded)
+    else:
+        bert_out = self.bert_model(
+            position_ids=position_ids,
+            inputs_embeds=x_embedded,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+        )
+    if hasattr(bert_out, "last_hidden_state"):
+        last_hidden_state = bert_out.last_hidden_state
+        out = self.linear(last_hidden_state)
+    elif hasattr(bert_out, "logits"):
+        out = bert_out.logits
+    else:
+        out = bert_out
+    out = self.final_activation(out)
+    return out

models/BERT_20240104-223349_loop_normalize_by_line_height_and_width_True_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00430.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c4ae65e81c722f3732563942ab40447a186869bebb1bbc8433a782805e73ac3
+size 86691676

models/BERT_20240104-233803_loop_normalize_by_line_height_and_width_False_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00719.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a7588696e4afc4c8ffb0ff361d9566b7b360c61a3bb6fd6fcb484942b6d2568b
+size 86692053

models/BERT_20240107-152040_loop_restrict_sim_data_to_4000_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00515.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:815b5500a1ae0a04bb55ae58c3896f07981757a2e1a2adf2cbc8a346551d88df
+size 86686270

models/BERT_20240108-000344_loop_normalize_by_line_height_and_width_False_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00706.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f2e56e1e33da611622315995e0cdf4db5aad6a086420401ca3ee95393b8977ac
+size 86692053

models/BERT_20240108-011230_loop_normalize_by_line_height_and_width_True_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00560.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4f060242cf0bc494d2908e0e99e9d411c9a9b131443cff91bb245229dad2f783
+size 86691676

models/BERT_20240109-090419_loop_normalize_by_line_height_and_width_False_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00518.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bbf23ac7baa88a957e1782158bd7a32aedcfcb0527b203079191ac259ec146c5
+size 86692053

models/BERT_20240122-183729_loop_normalize_by_line_height_and_width_True_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00523.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3fb7c8238752af51b64a23291080bb30edf9e090defcb2ec4015ddc8d543a9de
+size 86691740

models/BERT_20240122-194041_loop_normalize_by_line_height_and_width_False_dataset_folder_idx_evaluation_8_epoch=41-val_loss=0.00462.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:54fedcc5bdeda01bfae26bafcb7542c766807f1af9da7731aaa7ed38e93743d8
+size 86692117

models/BERT_fin_exp_20240104-223349.yaml ADDED Viewed

	@@ -0,0 +1,100 @@

+add_layer_norm_to_char_mlp: true
+add_layer_norm_to_in_projection: false
+add_line_overlap_feature: true
+add_normalised_values_as_features: false
+change_pooling_for_timm_head_to: AdaptiveAvgPool2d
+char_dims: 0
+char_plot_shape:
+- 224
+- 224
+chars_bert_reduction_factor: 4
+chars_conv_lr_reduction_factor: 1
+chars_conv_pooling_out_dim: 1
+convert_posix: false
+convert_winpath: false
+cv_char_modelname: coatnet_nano_rw_224
+cv_modelname: null
+early_stopping_patience: 15
+gamma_multistep: null
+gamma_step_factor: 0.5
+gamma_step_size: 3000
+head_multiplication_factor: 64
+hidden_dim_bert: 512
+hidden_dropout_prob: 0.0
+im_partial_string: fixations_chars_channel_sep
+input_padding_val: 10
+last_activation: Identity
+layer_norm_after_in_projection: true
+linear_activation: GELU
+load_best_checkpoint_at_end: false
+loss_function: corn_loss
+lr: 0.0004
+lr_initial: '0.0004'
+lr_sched_exp_fac: null
+lr_scheduling: StepLR
+manual_max_sequence_for_model: 500
+max_len_chars_list: 0
+max_seq_length: 500
+method_chars_into_model: resnet
+method_to_include_char_positions: concat
+min_lr_anneal: 1e-6
+model_to_use: BERT
+multistep_milestones: null
+n_layers_BERT: 4
+norm_by_char_averages: false
+norm_by_line_width: false
+norm_coords_by_letter_min_x_y: false
+normalize_by_line_height_and_width: true
+num_attention_heads: 8
+num_classes: 16
+num_lin_layers: 1
+num_warmup_steps: 3000
+one_hot_y: false
+ord_reg_loss_max: 16
+ord_reg_loss_min: -1
+padding_at_end: true
+plot_histogram: true
+plot_learning_curves: true
+precision: 16-mixed
+prediction_only: false
+pretrained_model_name_to_load: null
+profile_torch_run: false
+reload_model: false
+reload_model_date: null
+remove_eval_idx_from_train_idx: true
+remove_timm_classifier_head_pooling: true
+sample_cols:
+- x
+- y
+sample_means:
+- 0.7326
+- 6.6381
+- 2.4717
+sample_std:
+- 0.2778
+- 1.882
+- 1.8562
+sample_std_unscaled:
+- 285.193
+- 131.1842
+- 1.8562
+save_weights_only: true
+set_max_seq_len_manually: true
+set_num_classes_manually: true
+source_for_pretrained_cv_model: timm
+target_padding_number: -100
+track_activations_via_hook: false
+track_gradient_histogram: false
+use_char_bounding_boxes: true
+use_early_stopping: false
+use_embedded_char_pos_info: true
+use_fixation_duration_information: false
+use_in_projection_bias: false
+use_lr_warmup: true
+use_pupil_size_information: false
+use_reduce_on_plateau: false
+use_start_time_as_input_col: false
+use_training_steps_for_end_and_lr_decay: true
+use_words_coords: false
+warmup_exponent: 1
+weight_decay: 0.0

models/BERT_fin_exp_20240104-233803.yaml ADDED Viewed

	@@ -0,0 +1,100 @@

+add_layer_norm_to_char_mlp: true
+add_layer_norm_to_in_projection: false
+add_line_overlap_feature: true
+add_normalised_values_as_features: false
+change_pooling_for_timm_head_to: AdaptiveAvgPool2d
+char_dims: 0
+char_plot_shape:
+- 224
+- 224
+chars_bert_reduction_factor: 4
+chars_conv_lr_reduction_factor: 1
+chars_conv_pooling_out_dim: 1
+convert_posix: false
+convert_winpath: false
+cv_char_modelname: coatnet_nano_rw_224
+cv_modelname: null
+early_stopping_patience: 15
+gamma_multistep: null
+gamma_step_factor: 0.5
+gamma_step_size: 3000
+head_multiplication_factor: 64
+hidden_dim_bert: 512
+hidden_dropout_prob: 0.0
+im_partial_string: fixations_chars_channel_sep
+input_padding_val: 10
+last_activation: Identity
+layer_norm_after_in_projection: true
+linear_activation: GELU
+load_best_checkpoint_at_end: false
+loss_function: corn_loss
+lr: 0.0004
+lr_initial: '0.0004'
+lr_sched_exp_fac: null
+lr_scheduling: StepLR
+manual_max_sequence_for_model: 500
+max_len_chars_list: 0
+max_seq_length: 500
+method_chars_into_model: resnet
+method_to_include_char_positions: concat
+min_lr_anneal: 1e-6
+model_to_use: BERT
+multistep_milestones: null
+n_layers_BERT: 4
+norm_by_char_averages: false
+norm_by_line_width: false
+norm_coords_by_letter_min_x_y: false
+normalize_by_line_height_and_width: false
+num_attention_heads: 8
+num_classes: 16
+num_lin_layers: 1
+num_warmup_steps: 3000
+one_hot_y: false
+ord_reg_loss_max: 16
+ord_reg_loss_min: -1
+padding_at_end: true
+plot_histogram: true
+plot_learning_curves: true
+precision: 16-mixed
+prediction_only: false
+pretrained_model_name_to_load: null
+profile_torch_run: false
+reload_model: false
+reload_model_date: null
+remove_eval_idx_from_train_idx: true
+remove_timm_classifier_head_pooling: true
+sample_cols:
+- x
+- y
+sample_means:
+- 710.6114
+- 473.7518
+- 2.4717
+sample_std:
+- 285.1937
+- 131.1842
+- 1.8562
+sample_std_unscaled:
+- 285.193
+- 131.1842
+- 1.8562
+save_weights_only: true
+set_max_seq_len_manually: true
+set_num_classes_manually: true
+source_for_pretrained_cv_model: timm
+target_padding_number: -100
+track_activations_via_hook: false
+track_gradient_histogram: false
+use_char_bounding_boxes: true
+use_early_stopping: false
+use_embedded_char_pos_info: true
+use_fixation_duration_information: false
+use_in_projection_bias: false
+use_lr_warmup: true
+use_pupil_size_information: false
+use_reduce_on_plateau: false
+use_start_time_as_input_col: false
+use_training_steps_for_end_and_lr_decay: true
+use_words_coords: false
+warmup_exponent: 1
+weight_decay: 0.0

models/BERT_fin_exp_20240107-152040.yaml ADDED Viewed

	@@ -0,0 +1,100 @@

+add_layer_norm_to_char_mlp: true
+add_layer_norm_to_in_projection: false
+add_line_overlap_feature: true
+add_normalised_values_as_features: false
+change_pooling_for_timm_head_to: AdaptiveAvgPool2d
+char_dims: 0
+char_plot_shape:
+- 224
+- 224
+chars_bert_reduction_factor: 4
+chars_conv_lr_reduction_factor: 1
+chars_conv_pooling_out_dim: 1
+convert_posix: false
+convert_winpath: false
+cv_char_modelname: coatnet_nano_rw_224
+cv_modelname: null
+early_stopping_patience: 15
+gamma_multistep: null
+gamma_step_factor: 0.5
+gamma_step_size: 3000
+head_multiplication_factor: 64
+hidden_dim_bert: 512
+hidden_dropout_prob: 0.0
+im_partial_string: fixations_chars_channel_sep
+input_padding_val: 10
+last_activation: Identity
+layer_norm_after_in_projection: true
+linear_activation: GELU
+load_best_checkpoint_at_end: false
+loss_function: corn_loss
+lr: 0.0004
+lr_initial: '0.0004'
+lr_sched_exp_fac: null
+lr_scheduling: StepLR
+manual_max_sequence_for_model: 500
+max_len_chars_list: 0
+max_seq_length: 500
+method_chars_into_model: resnet
+method_to_include_char_positions: concat
+min_lr_anneal: 1e-6
+model_to_use: BERT
+multistep_milestones: null
+n_layers_BERT: 4
+norm_by_char_averages: false
+norm_by_line_width: false
+norm_coords_by_letter_min_x_y: true
+normalize_by_line_height_and_width: true
+num_attention_heads: 8
+num_classes: 16
+num_lin_layers: 1
+num_warmup_steps: 3000
+one_hot_y: false
+ord_reg_loss_max: 16
+ord_reg_loss_min: -1
+padding_at_end: true
+plot_histogram: true
+plot_learning_curves: true
+precision: 16-mixed
+prediction_only: false
+pretrained_model_name_to_load: null
+profile_torch_run: false
+reload_model: false
+reload_model_date: null
+remove_eval_idx_from_train_idx: true
+remove_timm_classifier_head_pooling: true
+sample_cols:
+- x
+- y
+sample_means:
+- 0.4423
+- 3.1164
+- 2.4717
+sample_std:
+- 0.2778
+- 1.882
+- 1.8562
+sample_std_unscaled:
+- 285.193
+- 131.1842
+- 1.8562
+save_weights_only: true
+set_max_seq_len_manually: true
+set_num_classes_manually: true
+source_for_pretrained_cv_model: timm
+target_padding_number: -100
+track_activations_via_hook: false
+track_gradient_histogram: false
+use_char_bounding_boxes: true
+use_early_stopping: false
+use_embedded_char_pos_info: true
+use_fixation_duration_information: false
+use_in_projection_bias: false
+use_lr_warmup: true
+use_pupil_size_information: false
+use_reduce_on_plateau: false
+use_start_time_as_input_col: false
+use_training_steps_for_end_and_lr_decay: true
+use_words_coords: false
+warmup_exponent: 1
+weight_decay: 0.0

models/BERT_fin_exp_20240108-000344.yaml ADDED Viewed

	@@ -0,0 +1,100 @@

+add_layer_norm_to_char_mlp: true
+add_layer_norm_to_in_projection: false
+add_line_overlap_feature: true
+add_normalised_values_as_features: false
+change_pooling_for_timm_head_to: AdaptiveAvgPool2d
+char_dims: 0
+char_plot_shape:
+- 224
+- 224
+chars_bert_reduction_factor: 4
+chars_conv_lr_reduction_factor: 1
+chars_conv_pooling_out_dim: 1
+convert_posix: false
+convert_winpath: true
+cv_char_modelname: coatnet_nano_rw_224
+cv_modelname: null
+early_stopping_patience: 15
+gamma_multistep: null
+gamma_step_factor: 0.5
+gamma_step_size: 3000
+head_multiplication_factor: 64
+hidden_dim_bert: 512
+hidden_dropout_prob: 0.0
+im_partial_string: fixations_chars_channel_sep
+input_padding_val: 10
+last_activation: Identity
+layer_norm_after_in_projection: true
+linear_activation: GELU
+load_best_checkpoint_at_end: false
+loss_function: corn_loss
+lr: 0.0004
+lr_initial: '0.0004'
+lr_sched_exp_fac: null
+lr_scheduling: StepLR
+manual_max_sequence_for_model: 500
+max_len_chars_list: 0
+max_seq_length: 500
+method_chars_into_model: resnet
+method_to_include_char_positions: concat
+min_lr_anneal: 1e-6
+model_to_use: BERT
+multistep_milestones: null
+n_layers_BERT: 4
+norm_by_char_averages: false
+norm_by_line_width: false
+norm_coords_by_letter_min_x_y: true
+normalize_by_line_height_and_width: false
+num_attention_heads: 8
+num_classes: 16
+num_lin_layers: 1
+num_warmup_steps: 3000
+one_hot_y: false
+ord_reg_loss_max: 16
+ord_reg_loss_min: -1
+padding_at_end: true
+plot_histogram: true
+plot_learning_curves: true
+precision: 16-mixed
+prediction_only: false
+pretrained_model_name_to_load: null
+profile_torch_run: false
+reload_model: false
+reload_model_date: null
+remove_eval_idx_from_train_idx: true
+remove_timm_classifier_head_pooling: true
+sample_cols:
+- x
+- y
+sample_means:
+- 455.5905
+- 218.0598
+- 2.4717
+sample_std:
+- 285.1936
+- 131.1842
+- 1.8562
+sample_std_unscaled:
+- 285.1939
+- 131.1844
+- 1.8562
+save_weights_only: true
+set_max_seq_len_manually: true
+set_num_classes_manually: true
+source_for_pretrained_cv_model: timm
+target_padding_number: -100
+track_activations_via_hook: false
+track_gradient_histogram: false
+use_char_bounding_boxes: true
+use_early_stopping: false
+use_embedded_char_pos_info: true
+use_fixation_duration_information: false
+use_in_projection_bias: false
+use_lr_warmup: true
+use_pupil_size_information: false
+use_reduce_on_plateau: false
+use_start_time_as_input_col: false
+use_training_steps_for_end_and_lr_decay: true
+use_words_coords: false
+warmup_exponent: 1
+weight_decay: 0.0

models/BERT_fin_exp_20240108-011230.yaml ADDED Viewed

	@@ -0,0 +1,100 @@

+add_layer_norm_to_char_mlp: true
+add_layer_norm_to_in_projection: false
+add_line_overlap_feature: true
+add_normalised_values_as_features: false
+change_pooling_for_timm_head_to: AdaptiveAvgPool2d
+char_dims: 0
+char_plot_shape:
+- 224
+- 224
+chars_bert_reduction_factor: 4
+chars_conv_lr_reduction_factor: 1
+chars_conv_pooling_out_dim: 1
+convert_posix: false
+convert_winpath: true
+cv_char_modelname: coatnet_nano_rw_224
+cv_modelname: null
+early_stopping_patience: 15
+gamma_multistep: null
+gamma_step_factor: 0.5
+gamma_step_size: 3000
+head_multiplication_factor: 64
+hidden_dim_bert: 512
+hidden_dropout_prob: 0.0
+im_partial_string: fixations_chars_channel_sep
+input_padding_val: 10
+last_activation: Identity
+layer_norm_after_in_projection: true
+linear_activation: GELU
+load_best_checkpoint_at_end: false
+loss_function: corn_loss
+lr: 0.0004
+lr_initial: '0.0004'
+lr_sched_exp_fac: null
+lr_scheduling: StepLR
+manual_max_sequence_for_model: 500
+max_len_chars_list: 0
+max_seq_length: 500
+method_chars_into_model: resnet
+method_to_include_char_positions: concat
+min_lr_anneal: 1e-6
+model_to_use: BERT
+multistep_milestones: null
+n_layers_BERT: 4
+norm_by_char_averages: false
+norm_by_line_width: false
+norm_coords_by_letter_min_x_y: true
+normalize_by_line_height_and_width: true
+num_attention_heads: 8
+num_classes: 16
+num_lin_layers: 1
+num_warmup_steps: 3000
+one_hot_y: false
+ord_reg_loss_max: 16
+ord_reg_loss_min: -1
+padding_at_end: true
+plot_histogram: true
+plot_learning_curves: true
+precision: 16-mixed
+prediction_only: false
+pretrained_model_name_to_load: null
+profile_torch_run: false
+reload_model: false
+reload_model_date: null
+remove_eval_idx_from_train_idx: true
+remove_timm_classifier_head_pooling: true
+sample_cols:
+- x
+- y
+sample_means:
+- 0.4423
+- 3.1164
+- 2.4717
+sample_std:
+- 0.2778
+- 1.882
+- 1.8562
+sample_std_unscaled:
+- 285.1939
+- 131.1844
+- 1.8562
+save_weights_only: true
+set_max_seq_len_manually: true
+set_num_classes_manually: true
+source_for_pretrained_cv_model: timm
+target_padding_number: -100
+track_activations_via_hook: false
+track_gradient_histogram: false
+use_char_bounding_boxes: true
+use_early_stopping: false
+use_embedded_char_pos_info: true
+use_fixation_duration_information: false
+use_in_projection_bias: false
+use_lr_warmup: true
+use_pupil_size_information: false
+use_reduce_on_plateau: false
+use_start_time_as_input_col: false
+use_training_steps_for_end_and_lr_decay: true
+use_words_coords: false
+warmup_exponent: 1
+weight_decay: 0.0

models/BERT_fin_exp_20240109-090419.yaml ADDED Viewed

	@@ -0,0 +1,100 @@

+add_layer_norm_to_char_mlp: true
+add_layer_norm_to_in_projection: false
+add_line_overlap_feature: true
+add_normalised_values_as_features: false
+change_pooling_for_timm_head_to: AdaptiveAvgPool2d
+char_dims: 0
+char_plot_shape:
+- 224
+- 224
+chars_bert_reduction_factor: 4
+chars_conv_lr_reduction_factor: 1
+chars_conv_pooling_out_dim: 1
+convert_posix: false
+convert_winpath: true
+cv_char_modelname: coatnet_nano_rw_224
+cv_modelname: null
+early_stopping_patience: 15
+gamma_multistep: null
+gamma_step_factor: 0.5
+gamma_step_size: 3000
+head_multiplication_factor: 64
+hidden_dim_bert: 512
+hidden_dropout_prob: 0.0
+im_partial_string: fixations_chars_channel_sep
+input_padding_val: 10
+last_activation: Identity
+layer_norm_after_in_projection: true
+linear_activation: GELU
+load_best_checkpoint_at_end: false
+loss_function: corn_loss
+lr: 0.0004
+lr_initial: '0.0004'
+lr_sched_exp_fac: null
+lr_scheduling: StepLR
+manual_max_sequence_for_model: 500
+max_len_chars_list: 0
+max_seq_length: 500
+method_chars_into_model: resnet
+method_to_include_char_positions: concat
+min_lr_anneal: 1e-6
+model_to_use: BERT
+multistep_milestones: null
+n_layers_BERT: 4
+norm_by_char_averages: false
+norm_by_line_width: false
+norm_coords_by_letter_min_x_y: true
+normalize_by_line_height_and_width: false
+num_attention_heads: 8
+num_classes: 16
+num_lin_layers: 1
+num_warmup_steps: 3000
+one_hot_y: false
+ord_reg_loss_max: 16
+ord_reg_loss_min: -1
+padding_at_end: true
+plot_histogram: true
+plot_learning_curves: true
+precision: 16-mixed
+prediction_only: false
+pretrained_model_name_to_load: null
+profile_torch_run: false
+reload_model: false
+reload_model_date: null
+remove_eval_idx_from_train_idx: true
+remove_timm_classifier_head_pooling: true
+sample_cols:
+- x
+- y
+sample_means:
+- 455.708
+- 217.8342
+- 2.4706
+sample_std:
+- 285.2534
+- 131.0263
+- 1.8542
+sample_std_unscaled:
+- 285.2527
+- 131.0262
+- 1.8543
+save_weights_only: true
+set_max_seq_len_manually: true
+set_num_classes_manually: true
+source_for_pretrained_cv_model: timm
+target_padding_number: -100
+track_activations_via_hook: false
+track_gradient_histogram: false
+use_char_bounding_boxes: true
+use_early_stopping: false
+use_embedded_char_pos_info: true
+use_fixation_duration_information: false
+use_in_projection_bias: false
+use_lr_warmup: true
+use_pupil_size_information: false
+use_reduce_on_plateau: false
+use_start_time_as_input_col: false
+use_training_steps_for_end_and_lr_decay: true
+use_words_coords: false
+warmup_exponent: 1
+weight_decay: 0.0

models/BERT_fin_exp_20240122-183729.yaml ADDED Viewed

	@@ -0,0 +1,102 @@

+add_layer_norm_to_char_mlp: true
+add_layer_norm_to_in_projection: false
+add_line_overlap_feature: true
+add_normalised_values_as_features: false
+add_woc_feature: false
+change_pooling_for_timm_head_to: AdaptiveAvgPool2d
+char_dims: 0
+char_plot_shape:
+- 224
+- 224
+chars_bert_reduction_factor: 4
+chars_conv_lr_reduction_factor: 1
+chars_conv_pooling_out_dim: 1
+convert_posix: false
+convert_winpath: false
+cv_char_modelname: coatnet_nano_rw_224
+cv_modelname: null
+early_stopping_patience: 15
+gamma_multistep: null
+gamma_step_factor: 0.5
+gamma_step_size: 3000
+head_multiplication_factor: 64
+hidden_dim_bert: 512
+hidden_dropout_prob: 0.0
+im_partial_string: fixations_chars_channel_sep
+input_padding_val: 10
+last_activation: Identity
+layer_norm_after_in_projection: true
+linear_activation: GELU
+load_best_checkpoint_at_end: false
+loss_function: corn_loss
+lr: 0.0004
+lr_initial: '0.0004'
+lr_sched_exp_fac: null
+lr_scheduling: StepLR
+manual_max_sequence_for_model: 500
+max_len_chars_list: 0
+max_seq_length: 500
+method_chars_into_model: resnet
+method_to_include_char_positions: concat
+min_lr_anneal: 1e-6
+model_to_use: BERT
+multistep_milestones: null
+n_layers_BERT: 4
+norm_by_char_averages: false
+norm_by_line_width: false
+norm_coords_by_letter_min_x_y: true
+normalize_by_line_height_and_width: true
+num_attention_heads: 8
+num_classes: 16
+num_lin_layers: 1
+num_warmup_steps: 3000
+one_hot_y: false
+only_use_2nd_input_stream: false
+ord_reg_loss_max: 16
+ord_reg_loss_min: -1
+padding_at_end: true
+plot_histogram: true
+plot_learning_curves: true
+precision: 16-mixed
+prediction_only: false
+pretrained_model_name_to_load: null
+profile_torch_run: false
+reload_model: false
+reload_model_date: null
+remove_eval_idx_from_train_idx: true
+remove_timm_classifier_head_pooling: true
+sample_cols:
+- x
+- y
+sample_means:
+- 0.4433
+- 2.9599
+- 2.3264
+sample_std:
+- 0.2782
+- 1.7872
+- 1.7619
+sample_std_unscaled:
+- 287.0107
+- 124.4113
+- 1.7619
+save_weights_only: true
+set_max_seq_len_manually: true
+set_num_classes_manually: true
+source_for_pretrained_cv_model: timm
+target_padding_number: -100
+track_activations_via_hook: false
+track_gradient_histogram: false
+use_char_bounding_boxes: true
+use_early_stopping: false
+use_embedded_char_pos_info: true
+use_fixation_duration_information: false
+use_in_projection_bias: false
+use_lr_warmup: true
+use_pupil_size_information: false
+use_reduce_on_plateau: false
+use_start_time_as_input_col: false
+use_training_steps_for_end_and_lr_decay: true
+use_words_coords: false
+warmup_exponent: 1
+weight_decay: 0.0

models/BERT_fin_exp_20240122-194041.yaml ADDED Viewed

	@@ -0,0 +1,102 @@

+add_layer_norm_to_char_mlp: true
+add_layer_norm_to_in_projection: false
+add_line_overlap_feature: true
+add_normalised_values_as_features: false
+add_woc_feature: false
+change_pooling_for_timm_head_to: AdaptiveAvgPool2d
+char_dims: 0
+char_plot_shape:
+- 224
+- 224
+chars_bert_reduction_factor: 4
+chars_conv_lr_reduction_factor: 1
+chars_conv_pooling_out_dim: 1
+convert_posix: false
+convert_winpath: false
+cv_char_modelname: coatnet_nano_rw_224
+cv_modelname: null
+early_stopping_patience: 15
+gamma_multistep: null
+gamma_step_factor: 0.5
+gamma_step_size: 3000
+head_multiplication_factor: 64
+hidden_dim_bert: 512
+hidden_dropout_prob: 0.0
+im_partial_string: fixations_chars_channel_sep
+input_padding_val: 10
+last_activation: Identity
+layer_norm_after_in_projection: true
+linear_activation: GELU
+load_best_checkpoint_at_end: false
+loss_function: corn_loss
+lr: 0.0004
+lr_initial: '0.0004'
+lr_sched_exp_fac: null
+lr_scheduling: StepLR
+manual_max_sequence_for_model: 500
+max_len_chars_list: 0
+max_seq_length: 500
+method_chars_into_model: resnet
+method_to_include_char_positions: concat
+min_lr_anneal: 1e-6
+model_to_use: BERT
+multistep_milestones: null
+n_layers_BERT: 4
+norm_by_char_averages: false
+norm_by_line_width: false
+norm_coords_by_letter_min_x_y: true
+normalize_by_line_height_and_width: false
+num_attention_heads: 8
+num_classes: 16
+num_lin_layers: 1
+num_warmup_steps: 3000
+one_hot_y: false
+only_use_2nd_input_stream: false
+ord_reg_loss_max: 16
+ord_reg_loss_min: -1
+padding_at_end: true
+plot_histogram: true
+plot_learning_curves: true
+precision: 16-mixed
+prediction_only: false
+pretrained_model_name_to_load: null
+profile_torch_run: false
+reload_model: false
+reload_model_date: null
+remove_eval_idx_from_train_idx: true
+remove_timm_classifier_head_pooling: true
+sample_cols:
+- x
+- y
+sample_means:
+- 459.3367
+- 206.88
+- 2.3264
+sample_std:
+- 287.0111
+- 124.4113
+- 1.7619
+sample_std_unscaled:
+- 287.0107
+- 124.4113
+- 1.7619
+save_weights_only: true
+set_max_seq_len_manually: true
+set_num_classes_manually: true
+source_for_pretrained_cv_model: timm
+target_padding_number: -100
+track_activations_via_hook: false
+track_gradient_histogram: false
+use_char_bounding_boxes: true
+use_early_stopping: false
+use_embedded_char_pos_info: true
+use_fixation_duration_information: false
+use_in_projection_bias: false
+use_lr_warmup: true
+use_pupil_size_information: false
+use_reduce_on_plateau: false
+use_start_time_as_input_col: false
+use_training_steps_for_end_and_lr_decay: true
+use_words_coords: false
+warmup_exponent: 1
+weight_decay: 0.0

multi_proc_funcs.py ADDED Viewed

	@@ -0,0 +1,2415 @@

+from icecream import ic
+from matplotlib import pyplot as plt
+import pathlib as pl
+import json
+from PIL import Image
+from torch.utils.data.dataloader import DataLoader as dl
+import matplotlib.patches as patches
+from torch.utils.data import Dataset as torch_dset
+import torchvision.transforms.functional as tvfunc
+import einops as eo
+from collections.abc import Iterable
+import numpy as np
+import pandas as pd
+from matplotlib import font_manager
+from matplotlib.font_manager import FontProperties
+from matplotlib.patches import Rectangle
+from tqdm.auto import tqdm
+import torch as t
+import plotly.express as px
+import copy
+import yaml
+import classic_correction_algos as calgo
+import analysis_funcs as anf
+import models
+import popEye_funcs as pf
+from loss_functions import corn_label_from_logits
+import torch.multiprocessing
+torch.multiprocessing.set_sharing_strategy('file_system') # Needed to make multi proc not fail on linux
+ic.configureOutput(includeContext=True)
+PLOTS_FOLDER = pl.Path("plots")
+event_strs = [
+    "EFIX",
+    "EFIX R",
+    "EFIX L",
+    "SSACC",
+    "ESACC",
+    "SFIX",
+    "MSG",
+    "SBLINK",
+    "EBLINK",
+    "BUTTON",
+    "INPUT",
+    "END",
+    "START",
+    "DISPLAY ON",
+]
+AVAILABLE_FONTS = [x.name for x in font_manager.fontManager.ttflist]
+COLORS = px.colors.qualitative.Alphabet
+RESULTS_FOLDER = pl.Path("results")
+PLOTS_FOLDER = pl.Path("plots")
+DIST_MODELS_FOLDER = pl.Path("models")
+IMAGENET_MEAN = [0.485, 0.456, 0.406]
+IMAGENET_STD = [0.229, 0.224, 0.225]
+DEFAULT_FIX_MEASURES = [
+    "letternum",
+    "letter",
+    "on_word_number",
+    "on_word",
+    "on_sentence",
+    "num_words_in_sentence",
+    "on_sentence_num",
+    "word_land",
+    "line_let",
+    "line_word",
+    "sac_in",
+    "sac_out",
+    "word_launch",
+    "word_refix",
+    "word_reg_in",
+    "word_reg_out",
+    "sentence_reg_in",
+    "word_firstskip",
+    "word_run",
+    "sentence_run",
+    "word_run_fix",
+    "word_cland",
+]
+ALL_FIX_MEASURES = DEFAULT_FIX_MEASURES + [
+    "angle_incoming",
+    "angle_outgoing",
+    "line_let_from_last_letter",
+    "sentence_word",
+    "line_let_previous",
+    "line_let_next",
+    "sentence_refix",
+    "word_reg_out_to",
+    "word_reg_in_from",
+    "sentence_reg_out",
+    "sentence_reg_in_from",
+    "sentence_reg_out_to",
+    "sentence_firstskip",
+    "word_runid",
+    "sentence_runid",
+    "word_fix",
+    "sentence_fix",
+    "sentence_run_fix",
+]
+class DSet(torch_dset):
+    def __init__(
+        self,
+        in_sequence: t.Tensor,
+        chars_center_coords_padded: t.Tensor,
+        out_categories: t.Tensor,
+        trialslist: list,
+        padding_list: list = None,
+        padding_at_end: bool = False,
+        return_images_for_conv: bool = False,
+        im_partial_string: str = "fixations_chars_channel_sep",
+        input_im_shape=[224, 224],
+    ) -> None:
+        super().__init__()
+        self.in_sequence = in_sequence
+        self.chars_center_coords_padded = chars_center_coords_padded
+        self.out_categories = out_categories
+        self.padding_list = padding_list
+        self.padding_at_end = padding_at_end
+        self.trialslist = trialslist
+        self.return_images_for_conv = return_images_for_conv
+        self.input_im_shape = input_im_shape
+        if return_images_for_conv:
+            self.im_partial_string = im_partial_string
+            self.plot_files = [
+                str(x["plot_file"]).replace("fixations_words", im_partial_string) for x in self.trialslist
+            ]
+    def __getitem__(self, index):
+        if self.return_images_for_conv:
+            im = Image.open(self.plot_files[index])
+            if [im.size[1], im.size[0]] != self.input_im_shape:
+                im = tvfunc.resize(im, self.input_im_shape)
+            im = tvfunc.normalize(tvfunc.to_tensor(im), IMAGENET_MEAN, IMAGENET_STD)
+        if self.chars_center_coords_padded is not None:
+            if self.padding_list is not None:
+                attention_mask = t.ones(self.in_sequence[index].shape[:-1], dtype=t.long)
+                if self.padding_at_end:
+                    if self.padding_list[index] > 0:
+                        attention_mask[-self.padding_list[index] :] = 0
+                else:
+                    attention_mask[: self.padding_list[index]] = 0
+                if self.return_images_for_conv:
+                    return (
+                        self.in_sequence[index],
+                        self.chars_center_coords_padded[index],
+                        im,
+                        attention_mask,
+                        self.out_categories[index],
+                    )
+                return (
+                    self.in_sequence[index],
+                    self.chars_center_coords_padded[index],
+                    attention_mask,
+                    self.out_categories[index],
+                )
+            else:
+                if self.return_images_for_conv:
+                    return (
+                        self.in_sequence[index],
+                        self.chars_center_coords_padded[index],
+                        im,
+                        self.out_categories[index],
+                    )
+                else:
+                    return (self.in_sequence[index], self.chars_center_coords_padded[index], self.out_categories[index])
+        if self.padding_list is not None:
+            attention_mask = t.ones(self.in_sequence[index].shape[:-1], dtype=t.long)
+            if self.padding_at_end:
+                if self.padding_list[index] > 0:
+                    attention_mask[-self.padding_list[index] :] = 0
+            else:
+                attention_mask[: self.padding_list[index]] = 0
+            if self.return_images_for_conv:
+                return (self.in_sequence[index], im, attention_mask, self.out_categories[index])
+            else:
+                return (self.in_sequence[index], attention_mask, self.out_categories[index])
+        if self.return_images_for_conv:
+            return (self.in_sequence[index], im, self.out_categories[index])
+        else:
+            return (self.in_sequence[index], self.out_categories[index])
+    def __len__(self):
+        if isinstance(self.in_sequence, t.Tensor):
+            return self.in_sequence.shape[0]
+        else:
+            return len(self.in_sequence)
+def remove_compile_from_model(model):
+    if hasattr(model.project, "_orig_mod"):
+        model.project = model.project._orig_mod
+        model.chars_conv = model.chars_conv._orig_mod
+        model.chars_classifier = model.chars_classifier._orig_mod
+        model.layer_norm_in = model.layer_norm_in._orig_mod
+        model.bert_model = model.bert_model._orig_mod
+        model.linear = model.linear._orig_mod
+    return model
+def remove_compile_from_dict(state_dict):
+    for key in list(state_dict.keys()):
+        newkey = key.replace("._orig_mod.", ".")
+        state_dict[newkey] = state_dict.pop(key)
+    return state_dict
+def load_model(model_file, cfg):
+    try:
+        model_loaded = t.load(model_file, map_location="cpu", weights_only=True)
+        if "hyper_parameters" in model_loaded.keys():
+            model_cfg_temp = model_loaded["hyper_parameters"]["cfg"]
+        else:
+            model_cfg_temp = cfg
+        model_state_dict = model_loaded["state_dict"]
+    except Exception as e:
+        ic(e)
+        ic(f"Failed to load {model_file}")
+        return None
+    model = models.LitModel(
+        [1, 500, 3],
+        model_cfg_temp["hidden_dim_bert"],
+        model_cfg_temp["num_attention_heads"],
+        model_cfg_temp["n_layers_BERT"],
+        model_cfg_temp["loss_function"],
+        1e-4,
+        model_cfg_temp["weight_decay"],
+        model_cfg_temp,
+        model_cfg_temp["use_lr_warmup"],
+        model_cfg_temp["use_reduce_on_plateau"],
+        track_gradient_histogram=model_cfg_temp["track_gradient_histogram"],
+        register_forw_hook=model_cfg_temp["track_activations_via_hook"],
+        char_dims=model_cfg_temp["char_dims"],
+    )
+    model = remove_compile_from_model(model)
+    model_state_dict = remove_compile_from_dict(model_state_dict)
+    with t.no_grad():
+        model.load_state_dict(model_state_dict, strict=False)
+    model.eval()
+    model.freeze()
+    return model
+def find_and_load_model(model_date: str):
+    model_cfg_file = list(DIST_MODELS_FOLDER.glob(f"*{model_date}*.yaml"))
+    if len(model_cfg_file) == 0:
+        ic(f"No model cfg yaml found for {model_date}")
+        return None, None
+    model_cfg_file = model_cfg_file[0]
+    with open(model_cfg_file) as f:
+        model_cfg = yaml.safe_load(f)
+    model_file = list(pl.Path("models").glob(f"*{model_date}*.ckpt"))[0]
+    model = load_model(model_file, model_cfg)
+    return model, model_cfg
+def set_up_models(dist_models_folder):
+    out_dict = {}
+    dist_models_with_norm = list(dist_models_folder.glob("*normalize_by_line_height_and_width_True*.ckpt"))
+    dist_models_without_norm = list(dist_models_folder.glob("*normalize_by_line_height_and_width_False*.ckpt"))
+    DIST_MODEL_DATE_WITH_NORM = dist_models_with_norm[0].stem.split("_")[1]
+    models_without_norm_df = [find_and_load_model(m_file.stem.split("_")[1]) for m_file in dist_models_without_norm]
+    models_with_norm_df = [find_and_load_model(m_file.stem.split("_")[1]) for m_file in dist_models_with_norm]
+    model_cfg_without_norm_df = [x[1] for x in models_without_norm_df if x[1] is not None][0]
+    model_cfg_with_norm_df = [x[1] for x in models_with_norm_df if x[1] is not None][0]
+    models_without_norm_df = [x[0] for x in models_without_norm_df if x[0] is not None]
+    models_with_norm_df = [x[0] for x in models_with_norm_df if x[0] is not None]
+    ensemble_model_avg = models.EnsembleModel(
+        models_without_norm_df, models_with_norm_df, learning_rate=0.0058, use_simple_average=True
+    )
+    out_dict["ensemble_model_avg"] = ensemble_model_avg
+    out_dict["model_cfg_without_norm_df"] = model_cfg_without_norm_df
+    out_dict["model_cfg_with_norm_df"] = model_cfg_with_norm_df
+    single_DIST_model, single_DIST_model_cfg = find_and_load_model(model_date=DIST_MODEL_DATE_WITH_NORM)
+    out_dict["single_DIST_model"] = single_DIST_model
+    out_dict["single_DIST_model_cfg"] = single_DIST_model_cfg
+    return out_dict
+def reorder_columns(
+    df,
+    cols=[
+        "subject",
+        "trial_id",
+        "item",
+        "condition",
+        "fixation_number",
+        "num",
+        "word_number",
+        "sentence_number",
+        "duration",
+        "start_uncorrected",
+        "stop_uncorrected",
+        "start_time",
+        "end_time",
+        "corrected_start_time",
+        "corrected_end_time",
+        "dX",
+        "dY",
+    ],
+):
+    existing_cols = [col for col in cols if col in df.columns]
+    other_cols = [col for col in df.columns if col not in cols]
+    return df[existing_cols + other_cols]
+def nan_or_int_minus_one(x):
+    if not pd.isna(x):
+        return int(x - 1.0)
+    else:
+        return pd.NA
+def add_popEye_cols_to_chars_df(chars_df):
+    if "letternum" not in chars_df.columns or "letline" not in chars_df.columns:
+        chars_df.reset_index(drop=False, inplace=True)
+        chars_df.rename({"index": "letternum"}, axis=1, inplace=True)
+        chars_df.loc[:, "letline"] = -1
+        chars_df["wordline"] = (
+            chars_df.groupby("assigned_line")["in_word_number"].rank(method="dense").map(nan_or_int_minus_one)
+        )
+        chars_df["wordsent"] = (
+            chars_df.groupby("in_sentence_number")["in_word_number"].rank(method="dense").map(nan_or_int_minus_one)
+        )
+        chars_df["letword"] = (
+            chars_df.groupby("in_word_number")["letternum"].rank(method="dense").map(nan_or_int_minus_one)
+        )
+        for line_idx in chars_df.assigned_line.unique():
+            chars_df.loc[chars_df.assigned_line == line_idx, "letline"] = (
+                chars_df.loc[chars_df.assigned_line == line_idx, "char"].reset_index().index
+            )
+    return chars_df
+def add_boxes_to_ax(
+    chars_list,
+    ax,
+    font_to_use="DejaVu Sans Mono",
+    fontsize=21,
+    prefix="char",
+    box_annotations: list = None,
+    edgecolor="grey",
+    linewidth=0.8,
+):
+    if box_annotations is None:
+        enum = chars_list
+    else:
+        enum = zip(chars_list, box_annotations)
+    for v in enum:
+        if box_annotations is not None:
+            v, annot_text = v
+        x0, y0 = v[f"{prefix}_xmin"], v[f"{prefix}_ymin"]
+        xdiff, ydiff = v[f"{prefix}_xmax"] - v[f"{prefix}_xmin"], v[f"{prefix}_ymax"] - v[f"{prefix}_ymin"]
+        ax.add_patch(Rectangle((x0, y0), xdiff, ydiff, edgecolor=edgecolor, facecolor="none", lw=linewidth, alpha=0.4))
+        if box_annotations is not None:
+            ax.annotate(
+                str(annot_text),
+                (x0 + xdiff / 2, y0),
+                horizontalalignment="center",
+                verticalalignment="center",
+                fontproperties=FontProperties(family=font_to_use, style="normal", size=fontsize / 1.5),
+            )
+def add_text_to_ax(
+    chars_list,
+    ax,
+    font_to_use="DejaVu Sans Mono",
+    fontsize=21,
+    prefix="char",
+):
+    font_props = FontProperties(family=font_to_use, style="normal", size=fontsize)
+    enum = chars_list
+    for v in enum:
+        ax.text(
+            v[f"{prefix}_x_center"],
+            v[f"{prefix}_y_center"],
+            v[prefix],
+            horizontalalignment="center",
+            verticalalignment="center",
+            fontproperties=font_props,
+        )
+def set_font_from_chars_list(trial):
+    if "chars_list" in trial:
+        chars_df = pd.DataFrame(trial["chars_list"])
+        line_diffs = np.diff(chars_df.char_y_center.unique())
+        y_diffs = np.unique(line_diffs)
+        if len(y_diffs) == 1:
+            y_diff = y_diffs[0]
+        else:
+            y_diff = np.min(y_diffs)
+        y_diff = round(y_diff * 2) / 2
+    else:
+        y_diff = 1 / 0.333 * 18
+    font_size = y_diff * 0.333  # pixel to point conversion
+    return round((font_size) * 4, ndigits=0) / 4
+def get_plot_props(trial, available_fonts):
+    if "font" in trial.keys():
+        font = trial["font"]
+        font_size = trial["font_size"]
+        if font not in available_fonts:
+            font = "DejaVu Sans Mono"
+    else:
+        font = "DejaVu Sans Mono"
+        font_size = 21
+    dpi = 96
+    if "display_coords" in trial.keys() and trial["display_coords"] is not None:
+        screen_res = (trial["display_coords"][2], trial["display_coords"][3])
+    else:
+        screen_res = (1920, 1080)
+    return font, font_size, dpi, screen_res
+def get_font_and_font_size_from_trial(trial):
+    font_face, font_size, dpi, screen_res = get_plot_props(trial, AVAILABLE_FONTS)
+    if font_size is None and "font_size" in trial:
+        font_size = trial["font_size"]
+    elif font_size is None:
+        font_size = set_font_from_chars_list(trial)
+    return font_face, font_size
+def sigmoid(x):
+    return 1 / (1 + np.exp(-1 * x))
+def matplotlib_plot_df(
+    dffix,
+    trial,
+    algo_choice,
+    dffix_no_clean=None,
+    desired_dpi=300,
+    fix_to_plot=[],
+    stim_info_to_plot=["Characters", "Word boxes"],
+    box_annotations: list = None,
+    font=None,
+    use_duration_arrow_sizes=True,
+):
+    chars_df = pd.DataFrame(trial["chars_list"]) if "chars_list" in trial else None
+    if chars_df is not None:
+        font_face, font_size = get_font_and_font_size_from_trial(trial)
+        font_size = font_size * 0.65
+    else:
+        ic("No character or word information available to plot")
+    if "display_coords" in trial:
+        desired_width_in_pixels = trial["display_coords"][2] + 1
+        desired_height_in_pixels = trial["display_coords"][3] + 1
+    else:
+        desired_width_in_pixels = 1920
+        desired_height_in_pixels = 1080
+    figure_width = desired_width_in_pixels / desired_dpi
+    figure_height = desired_height_in_pixels / desired_dpi
+    fig = plt.figure(figsize=(figure_width, figure_height), dpi=desired_dpi)
+    ax = fig.add_subplot(1, 1, 1)
+    fig.subplots_adjust(bottom=0)
+    fig.subplots_adjust(top=1)
+    fig.subplots_adjust(right=1)
+    fig.subplots_adjust(left=0)
+    if font is None:
+        if "font" in trial and trial["font"] in AVAILABLE_FONTS:
+            font_to_use = trial["font"]
+        else:
+            font_to_use = "DejaVu Sans Mono"
+    else:
+        font_to_use = font
+    if "font_size" in trial:
+        font_size = trial["font_size"]
+    else:
+        font_size = 20
+    if "Words" in stim_info_to_plot and "words_list" in trial:
+        add_text_to_ax(
+            trial["words_list"],
+            ax,
+            font_to_use,
+            prefix="word",
+            fontsize=font_size / 3.89,
+        )
+    if "Word boxes" in stim_info_to_plot and "words_list" in trial:
+        add_boxes_to_ax(
+            trial["words_list"],
+            ax,
+            font_to_use,
+            prefix="word",
+            fontsize=font_size / 3.89,
+            box_annotations=box_annotations,
+            edgecolor="black",
+            linewidth=0.9,
+        )
+    if "Characters" in stim_info_to_plot and "chars_list" in trial:
+        add_text_to_ax(
+            trial["chars_list"],
+            ax,
+            font_to_use,
+            prefix="char",
+            fontsize=font_size / 3.89,
+        )
+    if "Character boxes" in stim_info_to_plot and "chars_list" in trial:
+        add_boxes_to_ax(
+            trial["chars_list"],
+            ax,
+            font_to_use,
+            prefix="char",
+            fontsize=font_size / 3.89,
+            box_annotations=box_annotations,
+        )
+    if "Uncorrected Fixations" in fix_to_plot and dffix_no_clean is None:
+        if use_duration_arrow_sizes and "duration" in dffix.columns:
+            duration_scaled = dffix.duration - dffix.duration.min()
+            duration_scaled = (((duration_scaled / duration_scaled.max()) - 0.5) * 3).values
+            durations = sigmoid(duration_scaled) * 50 * 0.5
+        if use_duration_arrow_sizes:
+            ax.plot(
+                dffix.x,
+                dffix.y,
+                label="Raw fixations",
+                color="blue",
+                alpha=0.5,
+            )
+            add_arrow_annotations(dffix, "y", ax, "blue", durations[:-1])
+        else:
+            ax.plot(
+                dffix.x,
+                dffix.y,
+                label="Remaining fixations",
+                color="blue",
+                alpha=0.5,
+            )
+            add_arrow_annotations(dffix, "y", ax, "blue", 4)
+    if dffix_no_clean is not None and "Uncorrected Fixations" in fix_to_plot:
+        ax.plot(
+            dffix_no_clean.x,
+            dffix_no_clean.y,
+            # marker='.',
+            label="All fixations",
+            color="k",
+            alpha=0.5,
+            lw=1,
+        )
+        add_arrow_annotations(dffix_no_clean, "y", ax, "k", 4)
+        if "was_discarded_due_blinks" in dffix_no_clean.columns and dffix_no_clean["was_discarded_due_blinks"].any():
+            discarded_blink_fix = dffix_no_clean.loc[dffix_no_clean["was_discarded_due_blinks"], :].copy()
+            ax.scatter(
+                discarded_blink_fix.x,
+                discarded_blink_fix.y,
+                s=12,
+                label="Discarded due to blinks",
+                lw=1.5,
+                edgecolors="orange",
+                facecolors="none",
+            )
+        if (
+            "was_discarded_due_to_long_duration" in dffix_no_clean.columns
+            and dffix_no_clean["was_discarded_due_to_long_duration"].any()
+        ):
+            discarded_long_fix = dffix_no_clean.loc[dffix_no_clean["was_discarded_due_to_long_duration"], :].copy()
+            ax.scatter(
+                discarded_long_fix.x,
+                discarded_long_fix.y,
+                s=18,
+                label="Overly long fixations",
+                lw=0.8,
+                edgecolors="purple",
+                facecolors="none",
+            )
+        if "was_merged" in dffix_no_clean.columns:
+            merged_fix = dffix_no_clean.loc[dffix_no_clean["was_merged"], :].copy()
+            if not merged_fix.empty:
+                ax.scatter(
+                    merged_fix.x,
+                    merged_fix.y,
+                    s=7,
+                    label="Merged short fixations",
+                    lw=1,
+                    edgecolors="red",
+                    facecolors="none",
+                )
+        if "was_discarded_outside_text" in dffix_no_clean.columns:
+            was_discarded_outside_text_fix = dffix_no_clean.loc[dffix_no_clean["was_discarded_outside_text"], :].copy()
+            if not was_discarded_outside_text_fix.empty:
+                ax.scatter(
+                    was_discarded_outside_text_fix.x,
+                    was_discarded_outside_text_fix.y,
+                    s=8,
+                    label="Outside text fixations",
+                    lw=1.2,
+                    edgecolors="blue",
+                    facecolors="none",
+                )
+        if "was_discarded_short_fix" in dffix_no_clean.columns:
+            was_discarded_short_fix_fix = dffix_no_clean.loc[dffix_no_clean["was_discarded_short_fix"], :].copy()
+            if not was_discarded_short_fix_fix.empty:
+                ax.scatter(
+                    was_discarded_short_fix_fix.x,
+                    was_discarded_short_fix_fix.y,
+                    label="Discarded short fixations",
+                    s=9,
+                    lw=1.5,
+                    edgecolors="green",
+                    facecolors="none",
+                )
+    if "Corrected Fixations" in fix_to_plot:
+        if isinstance(algo_choice, list):
+            algo_choices = algo_choice
+            repeats = range(len(algo_choice))
+        else:
+            algo_choices = [algo_choice]
+            repeats = range(1)
+        for algoIdx in repeats:
+            algo_choice = algo_choices[algoIdx]
+            if f"y_{algo_choice}" in dffix.columns:
+                ax.plot(
+                    dffix.x,
+                    dffix.loc[:, f"y_{algo_choice}"],
+                    label=algo_choice,
+                    color=COLORS[algoIdx],
+                    alpha=0.6,
+                    linewidth=0.6,
+                )
+                add_arrow_annotations(dffix, f"y_{algo_choice}", ax, COLORS[algoIdx], 6)
+    ax.set_xlim((0, desired_width_in_pixels))
+    ax.set_ylim((0, desired_height_in_pixels))
+    ax.invert_yaxis()
+    if "Corrected Fixations" in fix_to_plot or "Uncorrected Fixations" in fix_to_plot:
+        ax.legend(prop={"size": 5})
+    return fig, desired_width_in_pixels, desired_height_in_pixels
+def add_arrow_annotations(dffix, y_col, ax, color, size):
+    x = dffix.x.values
+    y = dffix.loc[:, y_col].values
+    x = x[:-1]
+    y = y[:-1]
+    dX = -(x[1:] - x[:-1])
+    dY = -(y[1:] - y[:-1])
+    xpos = x[1:]
+    ypos = y[1:]
+    if isinstance(size, Iterable):
+        use_size_idx = True
+    else:
+        use_size_idx = False
+        s = size
+    for fidx, (X, Y, dX, dY) in enumerate(zip(xpos, ypos, dX, dY)):
+        if use_size_idx:
+            s = size[fidx]
+        ax.annotate(
+            "",
+            xytext=(X + 0.001 * dX, Y + 0.001 * dY),
+            xy=(X, Y),
+            arrowprops=dict(arrowstyle="fancy", color=color),
+            size=s,
+            alpha=0.3,
+        )
+def plot_saccade_df(fix_df, sac_df, trial, show_numbers=False, add_lines_to_fix_df=False):
+    stim_only_fig, _, _ = matplotlib_plot_df(
+        fix_df,
+        trial,
+        None,
+        dffix_no_clean=None,
+        desired_dpi=300,
+        fix_to_plot=[],
+        stim_info_to_plot=["Characters", "Word boxes"],
+        box_annotations=None,
+        font=None,
+    )
+    if stim_only_fig is None:
+        fig, ax = plt.subplots(1, 1, figsize=(8, 5), dpi=150)
+        invert_ax_needed = True
+    else:
+        fig = stim_only_fig
+        ax = fig.axes[0]
+        invert_ax_needed = False
+    def plot_arrow(x1, y1, x2, y2, scale_factor):
+        """Plot an arrow from (x1,y1) to (x2,y2) with adjustable size"""
+        ax.arrow(
+            x1,
+            y1,
+            (x2 - x1),
+            (y2 - y1),
+            color="k",
+            alpha=0.7,
+            length_includes_head=True,
+            width=3 * scale_factor,
+            head_width=15 * scale_factor,
+            head_length=15 * scale_factor,
+        )
+    xs = sac_df["xs"].values
+    ys = sac_df["ys"].values
+    xe = sac_df["xe"].values
+    ye = sac_df["ye"].values
+    extent = np.sqrt((xs.min() - xe.max()) ** 2 + (ys.min() - ye.max()) ** 2)
+    scale_factor = 0.0005 * extent
+    for i in range(len(xs)):
+        plot_arrow(xs[i], ys[i], xe[i], ye[i], scale_factor=scale_factor)
+    if add_lines_to_fix_df:
+        plotfunc = ax.plot
+    else:
+        plotfunc = ax.scatter
+    if "x" in fix_df.columns:
+        plotfunc(fix_df["x"], fix_df["y"], marker=".")
+    else:
+        plotfunc(fix_df["xs"], fix_df["ys"], marker=".")
+    if invert_ax_needed:
+        ax.invert_yaxis()
+    if show_numbers:
+        size = 8 * scale_factor
+        xytext = (
+            1,
+            -1,
+        )
+        for index, row in fix_df.iterrows():
+            ax.annotate(
+                index,
+                xy=(row["x"], row["y"]),
+                textcoords="offset points",
+                ha="center",
+                xytext=xytext,
+                va="bottom",
+                color="k",
+                size=size,
+            )
+        for index, row in sac_df.iterrows():
+            ax.annotate(
+                index,
+                xy=(row["xs"], row["ys"]),
+                textcoords="offset points",
+                ha="center",
+                xytext=xytext,
+                va="top",
+                color="r",
+                size=size,
+            )
+    return fig
+def get_events_df_from_lines_and_trial_selection(trial, trial_lines, discard_fixations_without_sfix):
+    line_dicts = []
+    fixations_dicts = []
+    events_dicts = []
+    blink_started = False
+    fixation_started = False
+    esac_count = 0
+    efix_count = 0
+    sfix_count = 0
+    sblink_count = 0
+    eblink_times = []
+    eye_to_use = "R"
+    for l in trial_lines:
+        if "EFIX R" in l:
+            eye_to_use = "R"
+            break
+        elif "EFIX L" in l:
+            eye_to_use = "L"
+            break
+    for l in trial_lines:
+        parts = [x.strip() for x in l.split("\t")]
+        if f"EFIX {eye_to_use}" in l:
+            efix_count += 1
+            if fixation_started:
+                had_SFIX_before_it = True
+                if parts[1] == "." and parts[2] == ".":
+                    continue
+                fixation_started = False
+            else:
+                had_SFIX_before_it = False
+            fix_dict = {
+                "fixation_number": efix_count,
+                "start_time": float(pd.to_numeric(parts[0].split()[-1].strip(), errors="coerce")),
+                "end_time": float(pd.to_numeric(parts[1].strip(), errors="coerce")),
+                "duration": float(pd.to_numeric(parts[2].strip(), errors="coerce")),
+                "x": float(pd.to_numeric(parts[3].strip(), errors="coerce")),
+                "y": float(pd.to_numeric(parts[4].strip(), errors="coerce")),
+                "pupil_size": float(pd.to_numeric(parts[5].strip(), errors="coerce")),
+                "had_SFIX_before_it": had_SFIX_before_it,
+                "msg": "FIX",
+            }
+            if not discard_fixations_without_sfix or had_SFIX_before_it:
+                fixations_dicts.append(fix_dict)
+                events_dicts.append(
+                    {
+                        "num": efix_count - 1,
+                        "start": float(pd.to_numeric(parts[0].split()[-1].strip(), errors="coerce")),
+                        "stop": float(pd.to_numeric(parts[1].strip(), errors="coerce")),
+                        "duration": float(pd.to_numeric(parts[2].strip(), errors="coerce")),
+                        "xs": float(pd.to_numeric(parts[3].strip(), errors="coerce")),
+                        "xe": None,
+                        "ys": float(pd.to_numeric(parts[4].strip(), errors="coerce")),
+                        "ye": None,
+                        "ampl": None,
+                        "pv": None,
+                        "pupil_size": float(pd.to_numeric(parts[5].strip(), errors="coerce")),
+                        "msg": "FIX",
+                    }
+                )
+            if len(fixations_dicts) >= 2:
+                assert fixations_dicts[-1]["start_time"] > fixations_dicts[-2]["start_time"], "start times not in order"
+        elif f"SFIX {eye_to_use}" in l:
+            sfix_count += 1
+            fixation_started = True
+        elif f"SBLINK {eye_to_use}" in l:
+            sblink_count += 1
+            blink_started = True
+        elif f"EBLINK {eye_to_use}" in l:
+            blink_started = False
+            blink_dict = {
+                "num": len(eblink_times),
+                "start": float(pd.to_numeric(parts[0].split()[-1].strip(), errors="coerce")),
+                "stop": float(pd.to_numeric(parts[1].strip(), errors="coerce")),
+                "duration": float(pd.to_numeric(parts[2].strip(), errors="coerce")),
+                "xs": None,
+                "xe": None,
+                "ys": None,
+                "ye": None,
+                "ampl": None,
+                "pv": None,
+                "pupil_size": None,
+                "msg": "BLINK",
+            }
+            events_dicts.append(blink_dict)
+            eblink_times.append(float(pd.to_numeric(parts[-1], errors="coerce")))
+        elif "ESACC" in l:
+            sac_dict = {
+                "num": esac_count,
+                "start": float(pd.to_numeric(parts[0].split()[-1].strip(), errors="coerce")),
+                "stop": float(pd.to_numeric(parts[1].strip(), errors="coerce")),
+                "duration": float(pd.to_numeric(parts[2].strip(), errors="coerce")),
+                "xs": float(pd.to_numeric(parts[3].strip(), errors="coerce")),
+                "ys": float(pd.to_numeric(parts[4].strip(), errors="coerce")),
+                "xe": float(pd.to_numeric(parts[5].strip(), errors="coerce")),
+                "ye": float(pd.to_numeric(parts[6].strip(), errors="coerce")),
+                "ampl": float(pd.to_numeric(parts[7].strip(), errors="coerce")),
+                "pv": float(pd.to_numeric(parts[8].strip(), errors="coerce")),
+                "pupil_size": None,
+                "msg": "SAC",
+            }
+            events_dicts.append(sac_dict)
+            esac_count += 1
+        if not blink_started and not any([True for x in event_strs if x in l]):
+            if len(parts) < 3 or (parts[1] == "." and parts[2] == "."):
+                continue
+            line_dicts.append(
+                {
+                    "idx": float(pd.to_numeric(parts[0].strip(), errors="coerce")),
+                    "x": float(pd.to_numeric(parts[1].strip(), errors="coerce")),
+                    "y": float(pd.to_numeric(parts[2].strip(), errors="coerce")),
+                    "p": float(pd.to_numeric(parts[3].strip(), errors="coerce")),
+                    "part_of_fixation": fixation_started,
+                    "fixation_number": sfix_count,
+                    "part_of_blink": blink_started,
+                    "blink_number": sblink_count,
+                }
+            )
+    trial["eblink_times"] = eblink_times
+    df = pd.DataFrame(line_dicts)
+    df["x_smoothed"] = np.convolve(df.x, np.ones((5,)) / 5, mode="same")  # popEye smoothes this way
+    df["y_smoothed"] = np.convolve(df.y, np.ones((5,)) / 5, mode="same")
+    df["time"] = df["idx"] - df["idx"].iloc[0]
+    df = pf.compute_velocity(df)
+    events_df = pd.DataFrame(events_dicts)
+    events_df["start_uncorrected"] = events_df.start
+    events_df["stop_uncorrected"] = events_df.stop
+    events_df["start"] = events_df.start - trial["trial_start_time"]
+    events_df["stop"] = events_df.stop - trial["trial_start_time"]
+    events_df["start"] = events_df["start"].clip(0, events_df["start"].max())
+    events_df.sort_values(by="start", inplace=True)  # Needed because blinks can happen during other events, I think
+    events_df.reset_index(drop=True, inplace=True)
+    events_df = pf.event_long(events_df)
+    events_df["duration"] = events_df["stop"] - events_df["start"]
+    trial["efix_count"] = efix_count
+    trial["eye_to_use"] = eye_to_use
+    trial["sfix_count"] = sfix_count
+    trial["sblink_count"] = sblink_count
+    return trial, df, events_df
+def add_default_font_and_character_props_to_state(trial):
+    chars_list = trial["chars_list"]
+    chars_df = pd.DataFrame(trial["chars_list"])
+    line_diffs = np.diff(chars_df.char_y_center.unique())
+    y_diffs = np.unique(line_diffs)
+    if len(y_diffs) > 1:
+        y_diff = np.min(y_diffs)
+    else:
+        y_diff = y_diffs[0]
+    y_diff = round(y_diff * 2) / 2
+    x_txt_start = chars_list[0]["char_xmin"]
+    y_txt_start = chars_list[0]["char_y_center"]
+    font_face, font_size = get_font_and_font_size_from_trial(trial)
+    line_height = y_diff
+    return y_diff, x_txt_start, y_txt_start, font_face, font_size, line_height
+def get_raw_events_df_and_trial(trial, discard_fixations_without_sfix):
+    fname = pl.Path(trial["filename"]).stem
+    trial_id = trial["trial_id"]
+    trial_lines = trial.pop("trial_lines")
+    trial["plot_file"] = str(PLOTS_FOLDER.joinpath(f"{fname}_{trial_id}_2ndInput_chars_channel_sep.png"))
+    trial, df, events_df = get_events_df_from_lines_and_trial_selection(
+        trial, trial_lines, discard_fixations_without_sfix
+    )
+    trial["gaze_df"] = df
+    font, font_size, dpi, screen_res = get_plot_props(trial, AVAILABLE_FONTS)
+    trial["font"] = font
+    trial["font_size"] = font_size
+    trial["dpi"] = dpi
+    trial["screen_res"] = screen_res
+    if "chars_list" in trial:
+        chars_df = pd.DataFrame(trial["chars_list"])
+        chars_df = add_popEye_cols_to_chars_df(chars_df)
+        if "index" not in chars_df.columns:
+            chars_df.reset_index(inplace=True)
+        trial["chars_df"] = chars_df.to_dict()
+        trial["y_char_unique"] = list(chars_df.char_y_center.sort_values().unique())
+    return reorder_columns(events_df), trial
+def get_outlier_indeces(
+    dffix, chars_df, x_thres_in_chars, y_thresh_in_heights, xcol, ycol, letter_width_avg, line_heights_avg
+):
+    indeces_out = []
+    for linenum, line_chars_subdf in chars_df.groupby("assigned_line"):
+        left = line_chars_subdf["char_xmin"].min()
+        right = line_chars_subdf["char_xmax"].max()
+        top = line_chars_subdf["char_ymin"].min()
+        bottom = line_chars_subdf["char_ymax"].max()
+        left_min = left - (x_thres_in_chars * letter_width_avg)
+        right_max = right + (x_thres_in_chars * letter_width_avg)
+        top_max = top - (line_heights_avg * y_thresh_in_heights)
+        bottom_min = bottom + (line_heights_avg * y_thresh_in_heights)
+        indeces_out_line = []
+        indeces_out_line.extend(list(dffix.loc[dffix[xcol] < left_min, :].index))
+        indeces_out_line.extend(list(dffix.loc[dffix[xcol] > right_max, :].index))
+        indeces_out_line.extend(list(dffix.loc[dffix[ycol] < top_max, :].index))
+        indeces_out_line.extend(list(dffix.loc[dffix[ycol] > bottom_min, :].index))
+        indeces_out_line_set = set(indeces_out_line)
+        indeces_out.append(indeces_out_line_set)
+    return list(set.intersection(*indeces_out))
+def get_distance_between_fixations_in_characters_and_recalc_duration(
+    fix, letter_width_avg, start_colname="start", stop_colname="stop", xcol="xs"
+):
+    fix.reset_index(drop=True, inplace=True)
+    fix.loc[:, "duration"] = fix[stop_colname] - fix[start_colname]
+    fix.loc[:, "distance_in_char_widths"] = 0.0
+    for i in range(1, len(fix)):
+        fix.loc[i, "distance_in_char_widths"] = np.round(
+            np.abs(fix.loc[i, xcol] - fix.loc[i - 1, xcol]) / letter_width_avg, decimals=3
+        )
+    return fix
+def clean_fixations_popeye_no_sacc(fix, trial, duration_threshold, distance_threshold):
+    if "letter_width_avg" in trial:
+        letter_width_avg = trial["letter_width_avg"]
+    else:
+        letter_width_avg = 12
+    stop_time_col, start_time_col = get_time_cols(fix)
+    if "xs" in fix.columns:
+        x_colname = "xs"
+        y_colname = "ys"
+    else:
+        x_colname = "x"
+        y_colname = "y"
+    if "blink" not in fix.columns:
+        fix["blink"] = 0
+    fix.dropna(subset=[x_colname, y_colname], how="any", axis=0, inplace=True)
+    fix.reset_index(drop=True, inplace=True)
+    fix = get_distance_between_fixations_in_characters_and_recalc_duration(
+        fix, letter_width_avg, start_time_col, stop_time_col, x_colname
+    )
+    fix["num"] = np.arange(len(fix), dtype=int)
+    i = 0
+    while i <= len(fix) - 1:
+        merge_before = False
+        merge_after = False
+        if fix["duration"].iloc[i] <= duration_threshold:
+            # check fixation n - 1
+            if i > 1:
+                if (
+                    fix["duration"].iloc[i - 1] > duration_threshold
+                    and fix["blink"].iloc[i - 1] == 0
+                    and fix["distance_in_char_widths"].iloc[i] <= distance_threshold
+                ):
+                    merge_before = True
+            # check fixation n + 1
+            if i < len(fix) - 1:
+                if (
+                    fix["duration"].iloc[i + 1] > duration_threshold
+                    and fix["blink"].iloc[i + 1] == 0
+                    and fix["distance_in_char_widths"].iloc[i + 1] <= distance_threshold
+                ):
+                    merge_after = True
+            # check merge.status
+            if merge_before and not merge_after:
+                merge = -1
+            elif not merge_before and merge_after:
+                merge = 1
+            elif not merge_before and not merge_after:
+                merge = 0
+            elif merge_before and merge_after:
+                if fix["duration"].iloc[i - 1] >= fix["duration"].iloc[i + 1]:
+                    merge = -1
+                else:
+                    merge = 1
+        # close if above duration threshold
+        else:
+            merge = 0
+        if merge == 0:
+            i += 1
+        elif merge == -1:
+            fix.loc[i - 1, stop_time_col] = fix.loc[i, stop_time_col]
+            fix.loc[i - 1, x_colname] = round((fix.loc[i - 1, x_colname] + fix.loc[i, x_colname]) / 2)
+            fix.loc[i - 1, y_colname] = round((fix.loc[i - 1, y_colname] + fix.loc[i, y_colname]) / 2)
+            fix = fix.drop(i, axis=0)
+            fix.reset_index(drop=True, inplace=True)
+            start = fix[start_time_col].iloc[i - 1]
+            stop = fix[stop_time_col].iloc[i - 1]
+            fix = get_distance_between_fixations_in_characters_and_recalc_duration(
+                fix, letter_width_avg, start_time_col, stop_time_col, x_colname
+            )
+        elif merge == 1:
+            fix.loc[i + 1, start_time_col] = fix.loc[i, start_time_col]
+            fix.loc[i + 1, x_colname] = round((fix.loc[i, x_colname] + fix.loc[i + 1, x_colname]) / 2)
+            fix.loc[i + 1, y_colname] = round((fix.loc[i, y_colname] + fix.loc[i + 1, y_colname]) / 2)
+            fix.drop(index=i, inplace=True)
+            fix.reset_index(drop=True, inplace=True)
+            start = fix.loc[i, start_time_col]
+            stop = fix.loc[i, stop_time_col]
+            fix = get_distance_between_fixations_in_characters_and_recalc_duration(
+                fix, letter_width_avg, start_time_col, stop_time_col, x_colname
+            )
+    fix.loc[:, "num"] = np.arange(len(fix), dtype=int)
+    # delete last fixation
+    if fix.iloc[-1]["duration"] < duration_threshold:
+        fix = fix.iloc[:-1]
+        trial["last_fixation_was_discarded_because_too_short"] = True
+    else:
+        trial["last_fixation_was_discarded_because_too_short"] = False
+    fix.reset_index(drop=True, inplace=True)
+    return fix.copy()
+def clean_dffix_own(
+    trial: dict,
+    choice_handle_short_and_close_fix: str,
+    discard_far_out_of_text_fix,
+    x_thres_in_chars,
+    y_thresh_in_heights,
+    short_fix_threshold,
+    merge_distance_threshold: float,
+    discard_long_fix: bool,
+    discard_long_fix_threshold: int,
+    discard_blinks: bool,
+    dffix: pd.DataFrame,
+):
+    dffix = dffix.dropna(how="all", axis=1).copy()
+    if dffix.empty:
+        return dffix, trial
+    dffix = dffix.rename(
+        {
+            k: v
+            for k, v in {
+                "xs": "x",
+                "ys": "y",
+                "num": "fixation_number",
+            }.items()
+            if v not in dffix.columns
+        },
+        axis=1,
+    )
+    stop_time_col, start_time_col = get_time_cols(dffix)
+    add_time_cols(dffix, stop_time_col, start_time_col)
+    if "dffix_no_clean" not in trial:
+        trial["dffix_no_clean"] = (
+            dffix.copy()
+        )  # TODO check if cleaning can be dialed in or if dffix get overwritten every time
+    add_time_cols(trial["dffix_no_clean"], stop_time_col, start_time_col)
+    trial["dffix_no_clean"]["was_merged"] = False
+    trial["dffix_no_clean"]["was_discarded_short_fix"] = False
+    trial["dffix_no_clean"]["was_discarded_outside_text"] = False
+    num_fix_before_clean = trial["dffix_no_clean"].shape[0]
+    trial["Fixation Cleaning Stats"] = {}
+    trial["Fixation Cleaning Stats"]["Number of fixations before cleaning"] = num_fix_before_clean
+    trial["Fixation Cleaning Stats"]["Discard fixation before or after blinks"] = discard_blinks
+    if discard_blinks and "blink" in dffix.columns:
+        trial["dffix_no_clean"]["was_discarded_due_blinks"] = False
+        dffix = dffix[dffix["blink"] == False].copy()
+        trial["dffix_no_clean"].loc[
+            ~trial["dffix_no_clean"]["start_time"].isin(dffix["start_time"]), "was_discarded_due_blinks"
+        ] = True
+        trial["Fixation Cleaning Stats"]["Number of discarded fixations due to blinks"] = (
+            num_fix_before_clean - dffix.shape[0]
+        )
+        trial["Fixation Cleaning Stats"]["Number of discarded fixations due to blinks (%)"] = round(
+            100
+            * (trial["Fixation Cleaning Stats"]["Number of discarded fixations due to blinks"] / num_fix_before_clean),
+            2,
+        )
+    trial["Fixation Cleaning Stats"]["Discard long fixations"] = discard_long_fix
+    if discard_long_fix and not dffix.empty:
+        dffix_before_long_fix_removal = dffix.copy()
+        trial["dffix_no_clean"]["was_discarded_due_to_long_duration"] = False
+        dffix = dffix[dffix["duration"] < discard_long_fix_threshold].copy()
+        dffix_after_long_fix_removal = dffix.copy()
+        trial["dffix_no_clean"].loc[
+            (
+                ~trial["dffix_no_clean"]["start_time"].isin(dffix_after_long_fix_removal["start_time"])
+                & (trial["dffix_no_clean"]["start_time"].isin(dffix_before_long_fix_removal["start_time"]))
+            ),
+            "was_discarded_due_to_long_duration",
+        ] = True
+        trial["Fixation Cleaning Stats"]["Number of discarded long fixations"] = num_fix_before_clean - dffix.shape[0]
+        trial["Fixation Cleaning Stats"]["Number of discarded long fixations (%)"] = round(
+            100 * (trial["Fixation Cleaning Stats"]["Number of discarded long fixations"] / num_fix_before_clean), 2
+        )
+    num_fix_before_merge = dffix.shape[0]
+    trial["Fixation Cleaning Stats"]["How short and close fixations were handled"] = choice_handle_short_and_close_fix
+    if (
+        choice_handle_short_and_close_fix == "Merge" or choice_handle_short_and_close_fix == "Merge then discard"
+    ) and not dffix.empty:
+        dffix_before_merge = dffix.copy()
+        dffix = clean_fixations_popeye_no_sacc(dffix, trial, short_fix_threshold, merge_distance_threshold)
+        dffix_after_merge = dffix.copy()
+        trial["dffix_no_clean"].loc[
+            (~trial["dffix_no_clean"]["start_time"].isin(dffix_after_merge["start_time"]))
+            & (trial["dffix_no_clean"]["start_time"].isin(dffix_before_merge["start_time"])),
+            "was_merged",
+        ] = True
+        if trial["last_fixation_was_discarded_because_too_short"]:
+            trial["dffix_no_clean"].iloc[-1, trial["dffix_no_clean"].columns.get_loc("was_merged")] = False
+            trial["dffix_no_clean"].iloc[-1, trial["dffix_no_clean"].columns.get_loc("was_discarded_short_fix")] = True
+        trial["Fixation Cleaning Stats"]["Number of merged fixations"] = (
+            num_fix_before_merge - dffix_after_merge.shape[0]
+        )
+        trial["Fixation Cleaning Stats"]["Number of merged fixations (%)"] = round(
+            100 * (trial["Fixation Cleaning Stats"]["Number of merged fixations"] / num_fix_before_merge), 2
+        )
+    if not dffix.empty:
+        dffix.reset_index(drop=True, inplace=True)
+        dffix.loc[:, "fixation_number"] = np.arange(dffix.shape[0])
+    trial["x_thres_in_chars"], trial["y_thresh_in_heights"] = x_thres_in_chars, y_thresh_in_heights
+    if "chars_list" in trial and not dffix.empty:
+        indeces_out = get_outlier_indeces(
+            dffix,
+            pd.DataFrame(trial["chars_list"]),
+            x_thres_in_chars,
+            y_thresh_in_heights,
+            "x",
+            "y",
+            trial["letter_width_avg"],
+            np.mean(trial["line_heights"]),
+        )
+    else:
+        indeces_out = []
+    dffix["is_far_out_of_text_uncorrected"] = "in"
+    if len(indeces_out) > 0:
+        times_out = dffix.loc[indeces_out, "start_time"].copy()
+        dffix.loc[indeces_out, "is_far_out_of_text_uncorrected"] = "out"
+    trial["Fixation Cleaning Stats"]["Far out of text fixations were discarded"] = discard_far_out_of_text_fix
+    if discard_far_out_of_text_fix and len(indeces_out) > 0:
+        num_fix_before_clean_via_discard_far_out_of_text_fix = dffix.shape[0]
+        trial["dffix_no_clean"].loc[
+            trial["dffix_no_clean"]["start_time"].isin(times_out), "was_discarded_outside_text"
+        ] = True
+        dffix = dffix.loc[dffix["is_far_out_of_text_uncorrected"] == "in", :].reset_index(drop=True).copy()
+        trial["Fixation Cleaning Stats"]["Number of discarded far-out-of-text fixations"] = (
+            num_fix_before_clean_via_discard_far_out_of_text_fix - dffix.shape[0]
+        )
+        trial["Fixation Cleaning Stats"]["Number of discarded far-out-of-text fixations (%)"] = round(
+            100
+            * (
+                trial["Fixation Cleaning Stats"]["Number of discarded far-out-of-text fixations"]
+                / num_fix_before_clean_via_discard_far_out_of_text_fix
+            ),
+            2,
+        )
+    dffix = dffix.drop(columns="is_far_out_of_text_uncorrected")
+    if (
+        choice_handle_short_and_close_fix == "Discard"
+        or choice_handle_short_and_close_fix == "Merge then discard"
+        and not dffix.empty
+    ):
+        num_fix_before_clean_via_discard_short = dffix.shape[0]
+        times_out = dffix.loc[(dffix["duration"] < short_fix_threshold), "start_time"].copy()
+        if len(times_out) > 0:
+            trial["dffix_no_clean"].loc[
+                trial["dffix_no_clean"]["start_time"].isin(times_out), "was_discarded_short_fix"
+            ] = True
+            dffix = dffix[(dffix["duration"] >= short_fix_threshold)].reset_index(drop=True).copy()
+            trial["Fixation Cleaning Stats"]["Number of discarded short fixations"] = (
+                num_fix_before_clean_via_discard_short - dffix.shape[0]
+            )
+            trial["Fixation Cleaning Stats"]["Number of discarded short fixations (%)"] = round(
+                100
+                * (trial["Fixation Cleaning Stats"]["Number of discarded short fixations"])
+                / num_fix_before_clean_via_discard_short,
+                2,
+            )
+    trial["Fixation Cleaning Stats"]["Total number of discarded and merged fixations"] = (
+        num_fix_before_clean - dffix.shape[0]
+    )
+    trial["Fixation Cleaning Stats"]["Total number of discarded and merged fixations (%)"] = round(
+        100 * trial["Fixation Cleaning Stats"]["Total number of discarded and merged fixations"] / num_fix_before_clean,
+        2,
+    )
+    if not dffix.empty:
+        droplist = ["num", "msg"]
+        if discard_blinks:
+            droplist += ["blink", "blink_before", "blink_after"]
+        for col in droplist:
+            if col in dffix.columns:
+                dffix = dffix.drop(col, axis=1)
+        if "start" in dffix.columns:
+            dffix = dffix.drop(axis=1, labels=["start", "stop"])
+        if "corrected_start_time" not in dffix.columns:
+            min_start_time = min(dffix["start_uncorrected"])
+            dffix["corrected_start_time"] = dffix["start_uncorrected"] - min_start_time
+            dffix["corrected_end_time"] = dffix["stop_uncorrected"] - min_start_time
+        assert all(np.diff(dffix["corrected_start_time"]) > 0), "start times not in order"
+        dffix_no_clean_fig, _, _ = matplotlib_plot_df(
+            dffix,
+            trial,
+            None,
+            trial["dffix_no_clean"],
+            box_annotations=None,
+            fix_to_plot=["Uncorrected Fixations"],
+            stim_info_to_plot=["Characters", "Word boxes"],
+        )
+        savename = f"{trial['subject']}_{trial['trial_id']}_clean_compare.png"
+        dffix_no_clean_fig.savefig(RESULTS_FOLDER.joinpath(savename), dpi=300, bbox_inches="tight")
+        plt.close(dffix_no_clean_fig)
+        dffix_clean_fig, _, _ = matplotlib_plot_df(
+            dffix,
+            trial,
+            None,
+            None,
+            box_annotations=None,
+            fix_to_plot=["Uncorrected Fixations"],
+            stim_info_to_plot=["Characters", "Word boxes"],
+            use_duration_arrow_sizes=False,
+        )
+        savename = f"{trial['subject']}_{trial['trial_id']}_after_clean.png"
+        dffix_clean_fig.savefig(RESULTS_FOLDER.joinpath(savename), dpi=300, bbox_inches="tight")
+        plt.close(dffix_clean_fig)
+        if "item" not in dffix.columns and "item" in trial:
+            dffix.insert(loc=0, column="item", value=trial["item"])
+        if "condition" not in dffix.columns and "condition" in trial:
+            dffix.insert(loc=0, column="condition", value=trial["condition"])
+        if "subject" not in dffix.columns and "subject" in trial:
+            dffix.insert(loc=0, column="subject", value=trial["subject"])
+        if "trial_id" not in dffix.columns and "trial_id" in trial:
+            dffix.insert(loc=0, column="trial_id", value=trial["trial_id"])
+        dffix = reorder_columns(dffix)
+    return dffix, trial
+def add_time_cols(dffix, stop_time_col, start_time_col):
+    if "start_time" not in dffix.columns:
+        dffix["start_time"] = dffix[start_time_col]
+    if "end_time" not in dffix.columns:
+        dffix["end_time"] = dffix[stop_time_col]
+    if "duration" not in dffix.columns:
+        dffix["duration"] = dffix["end_time"] - dffix["start_time"]
+def get_time_cols(dffix):
+    if "stop" in dffix.columns:
+        stop_time_col = "stop"
+    elif "end_time" in dffix.columns:
+        stop_time_col = "end_time"
+    elif "corrected_end_time" in dffix.columns:
+        stop_time_col = "corrected_end_time"
+    if "start" in dffix.columns:
+        start_time_col = "start"
+    elif "start_time" in dffix.columns:
+        start_time_col = "start_time"
+    elif "corrected_start_time" in dffix.columns:
+        start_time_col = "corrected_start_time"
+    return stop_time_col, start_time_col
+def trial_to_dfs(
+    trial: dict,
+    discard_fixations_without_sfix,
+    choice_handle_short_and_close_fix,
+    discard_far_out_of_text_fix,
+    x_thres_in_chars,
+    y_thresh_in_heights,
+    short_fix_threshold,
+    merge_distance_threshold,
+    discard_long_fix,
+    discard_long_fix_threshold,
+    discard_blinks,
+):
+    events_df, trial = get_raw_events_df_and_trial(trial, discard_fixations_without_sfix)
+    dffix, trial = clean_dffix_own(
+        trial,
+        choice_handle_short_and_close_fix,
+        discard_far_out_of_text_fix,
+        x_thres_in_chars,
+        y_thresh_in_heights,
+        short_fix_threshold,
+        merge_distance_threshold,
+        discard_long_fix,
+        discard_long_fix_threshold,
+        discard_blinks,
+        events_df[events_df["msg"] == "FIX"].copy(),
+    )
+    dffix = dffix.dropna(how="all", axis=1).copy()
+    trial["dffix"] = dffix
+    trial["events_df"] = events_df
+    return dffix, trial
+def get_all_measures(
+    trial,
+    dffix,
+    prefix,
+    use_corrected_fixations=True,
+    correction_algo="Wisdom_of_Crowds",
+    measures_to_calculate=["initial_landing_position"],
+    include_coords=False,
+    save_to_csv=False,
+):
+    stim_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    if f"{prefix}_number" not in stim_df.columns:
+        stim_df[f"{prefix}_number"] = np.arange(stim_df.shape[0])
+    if use_corrected_fixations:
+        dffix_copy = copy.deepcopy(dffix)
+        dffix_copy["y"] = dffix_copy[f"y_{correction_algo}"]
+    else:
+        dffix_copy = dffix
+        correction_algo = "uncorrected"
+    res_dfs = []
+    for measure in measures_to_calculate:
+        if hasattr(anf, f"{measure}_own"):
+            function = getattr(anf, f"{measure}_own")
+            result = function(trial, dffix_copy, prefix, correction_algo)
+            res_dfs.append(result)
+    dfs_list = [df for df in [stim_df] + res_dfs if not df.empty]
+    own_measure_df = stim_df
+    if len(dfs_list) > 1:
+        for df in dfs_list[1:]:
+            droplist = [col for col in df.columns if (col != f"{prefix}_number" and col in stim_df.columns)]
+            own_measure_df = own_measure_df.merge(df.drop(columns=droplist), how="left", on=[f"{prefix}_number"])
+    first_column = own_measure_df.pop(prefix)
+    own_measure_df.insert(0, prefix, first_column)
+    wordfirst = pf.aggregate_words_firstrun(dffix_copy, correction_algo, measures_to_calculate)
+    wordtmp = pf.aggregate_words(dffix_copy, pd.DataFrame(trial["words_list"]), correction_algo, measures_to_calculate)
+    out = pf.combine_words(
+        dffix_copy,
+        wordfirst=wordfirst,
+        wordtmp=wordtmp,
+        algo_choice=correction_algo,
+        measures_to_calculate=measures_to_calculate,
+    )
+    extra_cols = list(set(out.columns) - set(own_measure_df.columns))
+    cols_to_add = ["word_number"] + extra_cols
+    own_measure_df = pd.merge(own_measure_df, out.loc[:, cols_to_add], on="word_number", how="left")
+    first_cols = [
+        "subject",
+        "trial_id",
+        "item",
+        "condition",
+        "question_correct",
+        "word_number",
+        "word",
+    ]
+    for col in first_cols:
+        if col in trial and col not in own_measure_df.columns:
+            own_measure_df.insert(loc=0, column=col, value=trial[col])
+    own_measure_df = own_measure_df.dropna(how="all", axis=1).copy()
+    if not include_coords:
+        word_cols = ["word_xmin", "word_xmax", "word_ymax", "word_xmin", "word_ymin", "word_x_center", "word_y_center"]
+        own_measure_df = own_measure_df.drop(columns=word_cols)
+    own_measure_df = reorder_columns(own_measure_df)
+    if "question_correct" in own_measure_df.columns:
+        own_measure_df = own_measure_df.drop(columns=["question_correct"])
+    if save_to_csv:
+        own_measure_df.to_csv(
+            RESULTS_FOLDER / f"{trial['subject']}_{trial['trial_id']}_{correction_algo}_word_measures.csv"
+        )
+    return own_measure_df
+def add_line_overlaps_to_sample(trial, sample):
+    char_df = pd.DataFrame(trial["chars_list"])
+    line_overlaps = []
+    for arr in sample:
+        y_val = arr[1]
+        line_overlap = t.tensor(-1, dtype=t.float32)
+        for idx, (x1, x2) in enumerate(zip(char_df.char_ymin.unique(), char_df.char_ymax.unique())):
+            if x1 <= y_val <= x2:
+                line_overlap = t.tensor(idx, dtype=t.float32)
+                break
+        line_overlaps.append(line_overlap)
+    line_olaps_tensor = t.stack(line_overlaps, dim=0)
+    sample = t.cat([sample, line_olaps_tensor.unsqueeze(1)], dim=1)
+    return sample
+def norm_coords_by_letter_min_x_y(
+    sample_idx: int,
+    trialslist: list,
+    samplelist: list,
+    chars_center_coords_list: list = None,
+):
+    chars_df = pd.DataFrame(trialslist[sample_idx]["chars_list"])
+    trialslist[sample_idx]["x_char_unique"] = list(chars_df.char_xmin.unique())
+    min_x_chars = chars_df.char_xmin.min()
+    min_y_chars = chars_df.char_ymin.min()
+    norm_vector_substract = t.zeros(
+        (1, samplelist[sample_idx].shape[1]), dtype=samplelist[sample_idx].dtype, device=samplelist[sample_idx].device
+    )
+    norm_vector_substract[0, 0] = norm_vector_substract[0, 0] + 1 * min_x_chars
+    norm_vector_substract[0, 1] = norm_vector_substract[0, 1] + 1 * min_y_chars
+    samplelist[sample_idx] = samplelist[sample_idx] - norm_vector_substract
+    if chars_center_coords_list is not None:
+        norm_vector_substract = norm_vector_substract.squeeze(0)[:2]
+        if chars_center_coords_list[sample_idx].shape[-1] == norm_vector_substract.shape[-1] * 2:
+            chars_center_coords_list[sample_idx][:, :2] -= norm_vector_substract
+            chars_center_coords_list[sample_idx][:, 2:] -= norm_vector_substract
+        else:
+            chars_center_coords_list[sample_idx] -= norm_vector_substract
+    return trialslist, samplelist, chars_center_coords_list
+def norm_coords_by_letter_positions(
+    sample_idx: int,
+    trialslist: list,
+    samplelist: list,
+    meanlist: list = None,
+    stdlist: list = None,
+    return_mean_std_lists=False,
+    norm_by_char_averages=False,
+    chars_center_coords_list: list = None,
+    add_normalised_values_as_features=False,
+):
+    chars_df = pd.DataFrame(trialslist[sample_idx]["chars_list"])
+    trialslist[sample_idx]["x_char_unique"] = list(chars_df.char_xmin.unique())
+    min_x_chars = chars_df.char_xmin.min()
+    max_x_chars = chars_df.char_xmax.max()
+    norm_vector_multi = t.ones(
+        (1, samplelist[sample_idx].shape[1]), dtype=samplelist[sample_idx].dtype, device=samplelist[sample_idx].device
+    )
+    if norm_by_char_averages:
+        chars_list = trialslist[sample_idx]["chars_list"]
+        char_widths = np.asarray([x["char_xmax"] - x["char_xmin"] for x in chars_list])
+        char_heights = np.asarray([x["char_ymax"] - x["char_ymin"] for x in chars_list])
+        char_widths_average = np.mean(char_widths[char_widths > 0])
+        char_heights_average = np.mean(char_heights[char_heights > 0])
+        norm_vector_multi[0, 0] = norm_vector_multi[0, 0] * char_widths_average
+        norm_vector_multi[0, 1] = norm_vector_multi[0, 1] * char_heights_average
+    else:
+        line_height = min(np.unique(trialslist[sample_idx]["line_heights"]))
+        line_width = max_x_chars - min_x_chars
+        norm_vector_multi[0, 0] = norm_vector_multi[0, 0] * line_width
+        norm_vector_multi[0, 1] = norm_vector_multi[0, 1] * line_height
+    assert ~t.any(t.isnan(norm_vector_multi)), "Nan found in char norming vector"
+    norm_vector_multi = norm_vector_multi.squeeze(0)
+    if add_normalised_values_as_features:
+        norm_vector_multi = norm_vector_multi[norm_vector_multi != 1]
+        normed_features = samplelist[sample_idx][:, : norm_vector_multi.shape[0]] / norm_vector_multi
+        samplelist[sample_idx] = t.cat([samplelist[sample_idx], normed_features], dim=1)
+    else:
+        samplelist[sample_idx] = samplelist[sample_idx] / norm_vector_multi  #  in case time or pupil size is included
+    if chars_center_coords_list is not None:
+        norm_vector_multi = norm_vector_multi[:2]
+        if chars_center_coords_list[sample_idx].shape[-1] == norm_vector_multi.shape[-1] * 2:
+            chars_center_coords_list[sample_idx][:, :2] /= norm_vector_multi
+            chars_center_coords_list[sample_idx][:, 2:] /= norm_vector_multi
+        else:
+            chars_center_coords_list[sample_idx] /= norm_vector_multi
+    if return_mean_std_lists:
+        mean_val = samplelist[sample_idx].mean(axis=0).cpu().numpy()
+        meanlist.append(mean_val)
+        std_val = samplelist[sample_idx].std(axis=0).cpu().numpy()
+        stdlist.append(std_val)
+        assert ~any(pd.isna(mean_val)), "Nan found in mean_val"
+        assert ~any(pd.isna(mean_val)), "Nan found in std_val"
+        return trialslist, samplelist, meanlist, stdlist, chars_center_coords_list
+    return trialslist, samplelist, chars_center_coords_list
+def get_fig_ax(screen_res, dpi, words_df, x_margin, y_margin, dffix=None, prefix="word"):
+    fig = plt.figure(figsize=(screen_res[0] / dpi, screen_res[1] / dpi), dpi=dpi)
+    ax = plt.Axes(fig, [0.0, 0.0, 1.0, 1.0])
+    ax.set_axis_off()
+    if dffix is not None:
+        ax.set_ylim((dffix.y.min(), dffix.y.max()))
+        ax.set_xlim((dffix.x.min(), dffix.x.max()))
+    else:
+        ax.set_ylim((words_df[f"{prefix}_y_center"].min() - y_margin, words_df[f"{prefix}_y_center"].max() + y_margin))
+        ax.set_xlim((words_df[f"{prefix}_x_center"].min() - x_margin, words_df[f"{prefix}_x_center"].max() + x_margin))
+    ax.invert_yaxis()
+    fig.add_axes(ax)
+    return fig, ax
+def get_save_path(fpath, fname_ending):
+    save_path = PLOTS_FOLDER.joinpath(f"{fpath.stem}_{fname_ending}.png")
+    return save_path
+def save_im_load_convert(fpath, fig, fname_ending, mode):
+    save_path = get_save_path(fpath, fname_ending)
+    fig.savefig(save_path)
+    im = Image.open(save_path).convert(mode)
+    im.save(save_path)
+    return im
+def plot_text_boxes_fixations(
+    fpath,
+    dpi,
+    screen_res,
+    set_font_size: bool,
+    font_size: int,
+    dffix=None,
+    trial=None,
+):
+    if isinstance(fpath, str):
+        fpath = pl.Path(fpath)
+    prefix = "char"
+    if dffix is None:
+        dffix = pd.read_csv(fpath)
+    if trial is None:
+        json_fpath = str(fpath).replace("_fixations.csv", "_trial.json")
+        with open(json_fpath, "r") as f:
+            trial = json.load(f)
+    words_df = pd.DataFrame(trial[f"{prefix}s_list"])
+    x_right = words_df[f"{prefix}_xmin"]
+    x_left = words_df[f"{prefix}_xmax"]
+    y_top = words_df[f"{prefix}_ymax"]
+    y_bottom = words_df[f"{prefix}_ymin"]
+    if f"{prefix}_x_center" not in words_df.columns:
+        words_df[f"{prefix}_x_center"] = (words_df[f"{prefix}_xmax"] - words_df[f"{prefix}_xmin"]) / 2 + words_df[
+            f"{prefix}_xmin"
+        ]
+        words_df[f"{prefix}_y_center"] = (words_df[f"{prefix}_ymax"] - words_df[f"{prefix}_ymin"]) / 2 + words_df[
+            f"{prefix}_ymin"
+        ]
+    x_margin = words_df[f"{prefix}_x_center"].mean() / 8
+    y_margin = words_df[f"{prefix}_y_center"].mean() / 4
+    times = dffix.corrected_start_time - dffix.corrected_start_time.min()
+    times = times / times.max()
+    times = np.linspace(0.25, 1, len(times))
+    if set_font_size:
+        font = "monospace"
+    else:
+        font_size = trial["font_size"] * 27 // dpi
+    font_props = FontProperties(family=font, style="normal", size=font_size)
+    fig, ax = get_fig_ax(screen_res, dpi, words_df, x_margin, y_margin, prefix=prefix)
+    ax.scatter(words_df[f"{prefix}_x_center"], words_df[f"{prefix}_y_center"], s=1, facecolor="k", alpha=0.01)
+    for idx in range(len(x_left)):
+        ax.text(
+            words_df[f"{prefix}_x_center"][idx],
+            words_df[f"{prefix}_y_center"][idx],
+            words_df[prefix][idx],
+            horizontalalignment="center",
+            verticalalignment="center",
+            fontproperties=font_props,
+        )
+    fname_ending = f"{prefix}s_grey"
+    words_grey_im = save_im_load_convert(fpath, fig, fname_ending, "L")
+    plt.close("all")
+    fig, ax = get_fig_ax(screen_res, dpi, words_df, x_margin, y_margin, prefix=prefix)
+    ax.scatter(words_df[f"{prefix}_x_center"], words_df[f"{prefix}_y_center"], s=1, facecolor="k", alpha=0.1)
+    for idx in range(len(x_left)):
+        xdiff = x_right[idx] - x_left[idx]
+        ydiff = y_top[idx] - y_bottom[idx]
+        rect = patches.Rectangle(
+            (x_left[idx] - 1, y_bottom[idx] - 1), xdiff, ydiff, alpha=0.9, linewidth=1, edgecolor="k", facecolor="grey"
+        )  # seems to need one pixel offset
+        ax.add_patch(rect)
+    fname_ending = f"{prefix}_boxes_grey"
+    word_boxes_grey_im = save_im_load_convert(fpath, fig, fname_ending, "L")
+    plt.close("all")
+    fig, ax = get_fig_ax(screen_res, dpi, words_df, x_margin, y_margin, prefix=prefix)
+    ax.scatter(dffix.x, dffix.y, facecolor="k", alpha=times)
+    fname_ending = "fix_scatter_grey"
+    fix_scatter_grey_im = save_im_load_convert(fpath, fig, fname_ending, "L")
+    plt.close("all")
+    arr_combo = np.stack(
+        [
+            np.asarray(words_grey_im),
+            np.asarray(word_boxes_grey_im),
+            np.asarray(fix_scatter_grey_im),
+        ],
+        axis=2,
+    )
+    im_combo = Image.fromarray(arr_combo)
+    fname_ending = f"{prefix}s_channel_sep"
+    im_combo.save(fpath)
+    return im_combo
+def prep_data_for_dist(model_cfg, dffix, trial):
+    if isinstance(dffix, dict):
+        dffix = dffix["value"]
+    sample_tensor = t.tensor(dffix.loc[:, model_cfg["sample_cols"]].to_numpy(), dtype=t.float32)
+    if model_cfg["add_line_overlap_feature"]:
+        sample_tensor = add_line_overlaps_to_sample(trial, sample_tensor)
+    has_nans = t.any(t.isnan(sample_tensor))
+    assert not has_nans, "NaNs found in sample tensor"
+    samplelist_eval = [sample_tensor]
+    trialslist_eval = [trial]
+    chars_center_coords_list_eval = None
+    if model_cfg["norm_coords_by_letter_min_x_y"]:
+        for sample_idx, _ in enumerate(samplelist_eval):
+            trialslist_eval, samplelist_eval, chars_center_coords_list_eval = norm_coords_by_letter_min_x_y(
+                sample_idx,
+                trialslist_eval,
+                samplelist_eval,
+                chars_center_coords_list=chars_center_coords_list_eval,
+            )
+    if model_cfg["normalize_by_line_height_and_width"]:
+        meanlist_eval, stdlist_eval = [], []
+        for sample_idx, _ in enumerate(samplelist_eval):
+            (
+                trialslist_eval,
+                samplelist_eval,
+                meanlist_eval,
+                stdlist_eval,
+                chars_center_coords_list_eval,
+            ) = norm_coords_by_letter_positions(
+                sample_idx,
+                trialslist_eval,
+                samplelist_eval,
+                meanlist_eval,
+                stdlist_eval,
+                return_mean_std_lists=True,
+                norm_by_char_averages=model_cfg["norm_by_char_averages"],
+                chars_center_coords_list=chars_center_coords_list_eval,
+                add_normalised_values_as_features=model_cfg["add_normalised_values_as_features"],
+            )
+    sample_tensor = samplelist_eval[0]
+    sample_means = t.tensor(model_cfg["sample_means"], dtype=t.float32)
+    sample_std = t.tensor(model_cfg["sample_std"], dtype=t.float32)
+    sample_tensor = (sample_tensor - sample_means) / sample_std
+    sample_tensor = sample_tensor.unsqueeze(0)
+    if not pl.Path(trial["plot_file"]).exists():
+        plot_text_boxes_fixations(
+            fpath=trial["plot_file"],
+            dpi=250,
+            screen_res=(1024, 768),
+            set_font_size=True,
+            font_size=4,
+            dffix=dffix,
+            trial=trial,
+        )
+    val_set = DSet(
+        sample_tensor,
+        None,
+        t.zeros((1, sample_tensor.shape[1])),
+        trialslist_eval,
+        padding_list=[0],
+        padding_at_end=model_cfg["padding_at_end"],
+        return_images_for_conv=True,
+        im_partial_string=model_cfg["im_partial_string"],
+        input_im_shape=model_cfg["char_plot_shape"],
+    )
+    val_loader = dl(val_set, batch_size=1, shuffle=False, num_workers=0)
+    return val_loader, val_set
+def fold_in_seq_dim(out, y=None):
+    batch_size, seq_len, num_classes = out.shape
+    out = eo.rearrange(out, "b s c -> (b s) c", s=seq_len)
+    if y is None:
+        return out, None
+    if len(y.shape) > 2:
+        y = eo.rearrange(y, "b s c -> (b s) c", s=seq_len)
+    else:
+        y = eo.rearrange(y, "b s -> (b s)", s=seq_len)
+    return out, y
+def logits_to_pred(out, y=None):
+    seq_len = out.shape[1]
+    out, y = fold_in_seq_dim(out, y)
+    preds = corn_label_from_logits(out)
+    preds = eo.rearrange(preds, "(b s) -> b s", s=seq_len)
+    if y is not None:
+        y = eo.rearrange(y.squeeze(), "(b s) -> b s", s=seq_len)
+        y = y
+    return preds, y
+def get_DIST_preds(dffix, trial, models_dict):
+    algo_choice = "DIST"
+    model = models_dict["single_DIST_model"]
+    loader, dset = prep_data_for_dist(models_dict["single_DIST_model_cfg"], dffix, trial)
+    batch = next(iter(loader))
+    if "cpu" not in str(model.device):
+        batch = [x.cuda() for x in batch]
+    try:
+        out = model(batch)
+        preds, y = logits_to_pred(out, y=None)
+        if len(trial["y_char_unique"]) < 1:
+            y_char_unique = pd.DataFrame(trial["chars_list"]).char_y_center.sort_values().unique()
+        else:
+            y_char_unique = trial["y_char_unique"]
+        num_lines = trial["num_char_lines"] - 1
+        preds = t.clamp(preds, 0, num_lines).squeeze().cpu().numpy()
+        y_pred_DIST = [y_char_unique[idx] for idx in preds]
+        dffix[f"line_num_{algo_choice}"] = preds
+        dffix[f"y_{algo_choice}"] = np.round(y_pred_DIST, decimals=2)
+        dffix[f"y_{algo_choice}_correction"] = (dffix.loc[:, f"y_{algo_choice}"] - dffix.loc[:, "y"]).round(2)
+    except Exception as e:
+        ic(f"Exception on model(batch) for DIST \n{e}")
+    return dffix
+def get_DIST_ensemble_preds(
+    dffix,
+    trial,
+    model_cfg_without_norm_df,
+    model_cfg_with_norm_df,
+    ensemble_model_avg,
+):
+    algo_choice = "DIST-Ensemble"
+    loader_without_norm, dset_without_norm = prep_data_for_dist(model_cfg_without_norm_df, dffix, trial)
+    loader_with_norm, dset_with_norm = prep_data_for_dist(model_cfg_with_norm_df, dffix, trial)
+    batch_without_norm = next(iter(loader_without_norm))
+    batch_with_norm = next(iter(loader_with_norm))
+    out = ensemble_model_avg((batch_without_norm, batch_with_norm))
+    preds, y = logits_to_pred(out[0]["out_avg"], y=None)
+    if len(trial["y_char_unique"]) < 1:
+        y_char_unique = pd.DataFrame(trial["chars_list"]).char_y_center.sort_values().unique()
+    else:
+        y_char_unique = trial["y_char_unique"]
+    num_lines = trial["num_char_lines"] - 1
+    preds = t.clamp(preds, 0, num_lines).squeeze().cpu().numpy()
+    y_pred_DIST = [y_char_unique[idx] for idx in preds]
+    dffix[f"line_num_{algo_choice}"] = preds
+    dffix[f"y_{algo_choice}"] = np.round(y_pred_DIST, decimals=1)
+    dffix[f"y_{algo_choice}_correction"] = (dffix.loc[:, f"y_{algo_choice}"] - dffix.loc[:, "y"]).round(1)
+    return dffix
+def get_EDIST_preds_with_model_check(dffix, trial, models_dict):
+    dffix = get_DIST_ensemble_preds(
+        dffix,
+        trial,
+        models_dict["model_cfg_without_norm_df"],
+        models_dict["model_cfg_with_norm_df"],
+        models_dict["ensemble_model_avg"],
+    )
+    return dffix
+def get_all_classic_preds(dffix, trial, classic_algos_cfg):
+    corrections = []
+    for algo, classic_params in copy.deepcopy(classic_algos_cfg).items():
+        dffix = calgo.apply_classic_algo(dffix, trial, algo, classic_params)
+        corrections.append(np.asarray(dffix.loc[:, f"y_{algo}"]))
+    return dffix, corrections
+def apply_woc(dffix, trial, corrections, algo_choice):
+    corrected_Y = calgo.wisdom_of_the_crowd(corrections)
+    dffix.loc[:, f"y_{algo_choice}"] = corrected_Y
+    dffix[f"y_{algo_choice}_correction"] = (dffix.loc[:, f"y_{algo_choice}"] - dffix.loc[:, "y"]).round(1)
+    corrected_line_nums = [trial["y_char_unique"].index(y) for y in corrected_Y]
+    dffix.loc[:, f"line_num_y_{algo_choice}"] = corrected_line_nums
+    dffix.loc[:, f"line_num_{algo_choice}"] = corrected_line_nums
+    return dffix
+def apply_correction_algo(dffix, algo_choice, trial, models_dict, classic_algos_cfg):
+    if algo_choice == "DIST":
+        dffix = get_DIST_preds(dffix, trial, models_dict=models_dict)
+    elif algo_choice == "DIST-Ensemble":
+        dffix = get_EDIST_preds_with_model_check(dffix, trial, models_dict=models_dict)
+    elif algo_choice == "Wisdom_of_Crowds_with_DIST":
+        dffix, corrections = get_all_classic_preds(dffix, trial, classic_algos_cfg)
+        dffix = get_DIST_preds(dffix, trial, models_dict=models_dict)
+        for _ in range(3):
+            corrections.append(np.asarray(dffix.loc[:, "y_DIST"]))
+        dffix = apply_woc(dffix, trial, corrections, algo_choice)
+    elif algo_choice == "Wisdom_of_Crowds_with_DIST_Ensemble":
+        dffix, corrections = get_all_classic_preds(dffix, trial, classic_algos_cfg)
+        dffix = get_EDIST_preds_with_model_check(dffix, trial, models_dict=models_dict)
+        for _ in range(3):
+            corrections.append(np.asarray(dffix.loc[:, "y_DIST-Ensemble"]))
+        dffix = apply_woc(dffix, trial, corrections, algo_choice)
+    elif algo_choice == "Wisdom_of_Crowds":
+        dffix, corrections = get_all_classic_preds(dffix, trial, classic_algos_cfg)
+        dffix = apply_woc(dffix, trial, corrections, algo_choice)
+    else:
+        algo_cfg = classic_algos_cfg[algo_choice]
+        dffix = calgo.apply_classic_algo(dffix, trial, algo_choice, algo_cfg)
+        dffix[f"y_{algo_choice}_correction"] = (dffix.loc[:, f"y_{algo_choice}"] - dffix.loc[:, "y"]).round(1)
+    dffix = dffix.copy()  # apparently helps with fragmentation
+    return dffix
+def add_popEye_cols_to_dffix(dffix, algo_choice, chars_df, trial, xcol, cols_to_add: list):
+    """
+    Required for word or sentence measures:
+    - letternum
+    - letter
+    - on_word_number
+    - on_word
+    - on_sentence
+    - num_words_in_sentence
+    - on_sentence_num
+    - word_land
+    - line_let
+    - line_word
+    - sac_in
+    - sac_out
+    - word_launch
+    - word_refix
+    - word_reg_in
+    - word_reg_out
+    - sentence_reg_in
+    - word_firstskip
+    - word_run
+    - sentence_run
+    - word_run_fix
+    - word_cland
+    Optional:
+    - line_let_from_last_letter
+    - sentence_word
+    - line_let_previous
+    - line_let_next
+    - sentence_refix
+    - word_reg_out_to
+    - word_reg_in_from
+    - sentence_reg_out
+    - sentence_reg_in_from
+    - sentence_reg_out_to
+    - sentence_firstskip
+    - word_runid
+    - sentence_runid
+    - word_fix
+    - sentence_fix
+    """
+    if "angle_incoming" in cols_to_add:
+        x_diff_incoming = dffix[xcol].values - dffix[xcol].shift(1).values
+        y_diff_incoming = dffix["y"].values - dffix["y"].shift(1).values
+        angle_incoming = np.arctan2(y_diff_incoming, x_diff_incoming) * (180 / np.pi)
+        dffix["angle_incoming"] = angle_incoming
+    if "angle_outgoing" in cols_to_add:
+        x_diff_outgoing = dffix[xcol].shift(-1).values - dffix[xcol].values
+        y_diff_outgoing = dffix["y"].shift(-1).values - dffix["y"].values
+        angle_outgoing = np.arctan2(y_diff_outgoing, x_diff_outgoing) * (180 / np.pi)
+        dffix["angle_outgoing"] = angle_outgoing
+    dffix[f"line_change_{algo_choice}"] = np.concatenate(
+        ([0], np.diff(dffix[f"line_num_{algo_choice}"])), axis=0
+    ).astype(int)
+    for i in list(dffix.index):
+        if dffix.loc[i, f"line_num_{algo_choice}"] > -1 and not pd.isna(dffix.loc[i, f"line_num_{algo_choice}"]):
+            selected_stimmat = chars_df[
+                chars_df["assigned_line"] == dffix.loc[i, f"line_num_{algo_choice}"]
+            ].reset_index()
+            selected_stimmat.loc[:, "letword"] = selected_stimmat.groupby("in_word_number")["letternum"].rank()
+            letters_on_line = selected_stimmat.shape[0]
+            out = dffix.loc[i, xcol] - selected_stimmat["char_x_center"]
+            min_idx = out.abs().idxmin()
+            dffix.loc[i, f"letternum_{algo_choice}"] = selected_stimmat.loc[min_idx, "letternum"]
+            dffix.loc[i, f"letter_{algo_choice}"] = selected_stimmat.loc[min_idx, "char"]
+            dffix.loc[i, f"line_let_{algo_choice}"] = selected_stimmat.loc[min_idx, "letline"]
+            if "line_let_from_last_letter" in cols_to_add:
+                dffix.loc[i, f"line_let_from_last_letter_{algo_choice}"] = (
+                    letters_on_line - dffix.loc[i, f"line_let_{algo_choice}"]
+                )
+            word_min_idx = min_idx
+            if (
+                selected_stimmat.loc[min_idx, "char"] == " "
+                and (min_idx - 1) in selected_stimmat.index
+                and (min_idx + 1) in selected_stimmat.index
+            ):
+                dist_to_previous_letter = np.abs(
+                    dffix.loc[i, xcol] - selected_stimmat.loc[min_idx - 1, "char_x_center"]
+                )
+                dist_to_following_letter = np.abs(
+                    dffix.loc[i, xcol] - selected_stimmat.loc[min_idx + 1, "char_x_center"]
+                )
+                if dist_to_previous_letter < dist_to_following_letter:
+                    word_min_idx = min_idx - 1
+            if not pd.isna(selected_stimmat.loc[min_idx, "in_word_number"]):
+                dffix.loc[i, f"on_word_number_{algo_choice}"] = selected_stimmat.loc[word_min_idx, "in_word_number"]
+                dffix.loc[i, f"on_word_{algo_choice}"] = selected_stimmat.loc[word_min_idx, "in_word"]
+                dffix.loc[i, f"word_land_{algo_choice}"] = selected_stimmat.loc[
+                    word_min_idx, "num_letters_from_start_of_word"
+                ]
+                dffix.loc[i, f"line_word_{algo_choice}"] = selected_stimmat.loc[word_min_idx, "wordline"]
+                if "sentence_word" in cols_to_add:
+                    dffix.loc[i, f"sentence_word_{algo_choice}"] = selected_stimmat.loc[word_min_idx, "wordsent"]
+            dffix.loc[i, "num_words_in_sentence"] = len(selected_stimmat.loc[word_min_idx, "in_sentence"].split(" "))
+            dffix.loc[i, f"on_sentence_num_{algo_choice}"] = selected_stimmat.loc[word_min_idx, "in_sentence_number"]
+            dffix.loc[i, f"on_sentence_{algo_choice}"] = selected_stimmat.loc[word_min_idx, "in_sentence"]
+    if "line_let_previous" in cols_to_add:
+        dffix[f"line_let_previous_{algo_choice}"] = dffix[f"line_let_{algo_choice}"].shift(-1)
+    if "line_let_next" in cols_to_add:
+        dffix[f"line_let_next_{algo_choice}"] = dffix[f"line_let_{algo_choice}"].shift(1)
+    dffix = pf.compute_saccade_length(dffix, chars_df, algo_choice)
+    dffix = pf.compute_launch_distance(dffix, algo_choice)
+    dffix = pf.compute_refixation(dffix, algo_choice)
+    dffix = pf.compute_regression(dffix, algo_choice)
+    dffix = pf.compute_firstskip(dffix, algo_choice)
+    dffix = pf.compute_run(dffix, algo_choice)
+    dffix = pf.compute_landing_position(dffix, algo_choice)
+    dffix = dffix.loc[:, ~dffix.columns.duplicated()]
+    return dffix
+def export_dataframe(df: pd.DataFrame, csv_name: str):
+    if isinstance(df, dict):
+        df = df["value"]
+    df.to_csv(csv_name)
+    return csv_name
+def _convert_to_json(obj):
+    if isinstance(obj, (int, float, str, bool)):
+        return obj
+    elif isinstance(obj, dict):
+        return {k: _convert_to_json(v) for k, v in obj.items()}
+    elif isinstance(obj, list) or isinstance(obj, tuple):
+        return [_convert_to_json(item) for item in obj]
+    elif isinstance(obj, dict):
+        return {k: _convert_to_json(val) for k, val in obj.items()}
+    elif hasattr(obj, "to_dict"):
+        return _convert_to_json(obj.to_dict())
+    elif hasattr(obj, "tolist"):
+        return _convert_to_json(obj.tolist())
+    elif obj is None:
+        return None
+    else:
+        raise TypeError(f"Object of type {type(obj)} is not JSON serializable")
+def save_trial_to_json(trial, savename):
+    filtered_trial = {}
+    for key, value in trial.items():
+        try:
+            filtered_trial[key] = _convert_to_json(value)
+        except TypeError as e:
+            ic(f"Warning: Skipping non-serializable value for key '{key}' due to error: {e}")
+    with open(savename, "w", encoding="utf-8") as f:
+        json.dump(filtered_trial, f, ensure_ascii=False, indent=4)
+def export_trial(trial: dict):
+    trial_id = trial["trial_id"]
+    savename = RESULTS_FOLDER.joinpath(pl.Path(trial["filename"]).stem)
+    trial_name = f"{savename}_{trial_id}_trial_info.json"
+    filtered_trial = copy.deepcopy(trial)
+    _ = [filtered_trial.pop(k) for k in list(filtered_trial.keys()) if isinstance(filtered_trial[k], pd.DataFrame)]
+    _ = [
+        filtered_trial.pop(k)
+        for k in list(filtered_trial.keys())
+        if k
+        in [
+            "words_list",
+            "chars_list",
+            "chars_df_alt",
+            "EMReading_fix",
+            "chars_df",
+            "dffix_sacdf_popEye",
+            "fixdf_popEye",
+            "sacdf_popEye",
+            "saccade_df",
+            "combined_df",
+            "own_sentence_measures_dfs_for_algo",
+            "own_word_measures_dfs_for_algo",
+        ]
+    ]
+    filtered_trial["line_heights"] = list(np.unique(filtered_trial["line_heights"]))
+    save_trial_to_json(filtered_trial, trial_name)
+    return trial_name
+def add_cols_from_trial(trial, df, cols=["item", "condition", "trial_id", "subject"]):
+    for col in cols:
+        if col not in df.columns:
+            df.insert(loc=0, column=col, value=trial[col])
+def correct_df(
+    dffix,
+    algo_choice,
+    trial,
+    for_multi,
+    is_outside_of_streamlit,
+    classic_algos_cfg,
+    models_dict,
+    measures_to_calculate_multi_asc=[],
+    include_coords_multi_asc=False,
+    sent_measures_to_calc_multi=[],
+    fix_cols_to_add=[],
+):
+    if is_outside_of_streamlit:
+        stqdm = tqdm
+    else:
+        from stqdm import stqdm
+    if isinstance(dffix, dict):
+        dffix = dffix["value"]
+    if "x" not in dffix.keys() or "x" not in dffix.keys():
+        ic(f"x or y not in dffix")
+        ic(dffix.columns)
+        return dffix
+    if isinstance(algo_choice, list):
+        algo_choices = algo_choice
+        repeats = range(len(algo_choice))
+    else:
+        algo_choices = [algo_choice]
+        repeats = range(1)
+    chars_df = pd.DataFrame(trial["chars_df"]) if "chars_df" in trial else pd.DataFrame(trial["chars_list"])
+    if for_multi:
+        own_word_measures_dfs_for_algo = []
+        own_sentence_measures_dfs_for_algo = []
+    trial["average_y_corrections"] = []
+    for algoIdx in stqdm(repeats, desc="Applying line-assignment algorithms"):
+        algo_choice = algo_choices[algoIdx]
+        dffix = apply_correction_algo(dffix, algo_choice, trial, models_dict, classic_algos_cfg)
+        average_y_correction = (dffix[f"y_{algo_choice}"] - dffix["y"]).mean().round(1)
+        trial["average_y_corrections"].append({"Algorithm": algo_choice, "average_y_correction": average_y_correction})
+        fig, desired_width_in_pixels, desired_height_in_pixels = matplotlib_plot_df(
+            dffix,
+            trial,
+            algo_choice,
+            None,
+            box_annotations=None,
+            fix_to_plot=["Uncorrected Fixations", "Corrected Fixations"],
+            stim_info_to_plot=["Characters", "Word boxes"],
+        )
+        savename = f"{trial['subject']}_{trial['trial_id']}_corr_{algo_choice}_fix.png"
+        fig.savefig(RESULTS_FOLDER.joinpath(savename), dpi=300)
+        plt.close(fig)
+        dffix = add_popEye_cols_to_dffix(dffix, algo_choice, chars_df, trial, "x", cols_to_add=fix_cols_to_add)
+        if for_multi and len(measures_to_calculate_multi_asc) > 0 and dffix.shape[0] > 1:
+            own_word_measures = get_all_measures(
+                trial,
+                dffix,
+                prefix="word",
+                use_corrected_fixations=True,
+                correction_algo=algo_choice,
+                measures_to_calculate=measures_to_calculate_multi_asc,
+                include_coords=include_coords_multi_asc,
+            )
+            own_word_measures_dfs_for_algo.append(own_word_measures)
+            sent_measures_multi = pf.compute_sentence_measures(
+                dffix, pd.DataFrame(trial["chars_df"]), algo_choice, sent_measures_to_calc_multi
+            )
+            own_sentence_measures_dfs_for_algo.append(sent_measures_multi)
+    if for_multi and len(own_word_measures_dfs_for_algo) > 0:
+        words_df = (
+            pd.DataFrame(trial["chars_df"])
+            .drop_duplicates(subset="in_word_number", keep="first")
+            .loc[:, ["in_word_number", "in_word"]]
+            .rename({"in_word_number": "word_number", "in_word": "word"}, axis=1)
+            .reset_index(drop=True)
+        )
+        add_cols_from_trial(trial, words_df, cols=["item", "condition", "trial_id", "subject"])
+        words_df["subject_trialID"] = [f"{id}_{num}" for id, num in zip(words_df["subject"], words_df["trial_id"])]
+        words_df = words_df.merge(
+            own_word_measures_dfs_for_algo[0],
+            how="left",
+            on=["subject", "trial_id", "item", "condition", "word_number", "word"],
+        )
+        for word_measure_df in own_word_measures_dfs_for_algo[1:]:
+            words_df = words_df.merge(
+                word_measure_df, how="left", on=["subject", "trial_id", "item", "condition", "word_number", "word"]
+            )
+        words_df = reorder_columns(words_df, ["subject", "trial_id", "item", "condition", "word_number", "word"])
+        sentence_df = (
+            pd.DataFrame(trial["chars_df"])
+            .drop_duplicates(subset="in_sentence_number", keep="first")
+            .loc[
+                :,
+                [
+                    "in_sentence_number",
+                    "in_sentence",
+                ],
+            ]
+            .rename({"in_sentence_number": "sentence_number", "in_sentence": "sentence"}, axis=1)
+            .reset_index(drop=True)
+        )
+        add_cols_from_trial(trial, sentence_df, cols=["item", "condition", "trial_id", "subject"])
+        sentence_df["subject_trialID"] = [
+            f"{id}_{num}" for id, num in zip(sentence_df["subject"], sentence_df["trial_id"])
+        ]
+        sentence_df = sentence_df.merge(
+            own_sentence_measures_dfs_for_algo[0],
+            how="left",
+            on=["item", "condition", "trial_id", "subject", "sentence_number", "sentence"],
+        )
+        for sent_measure_df in own_sentence_measures_dfs_for_algo[1:]:
+            sentence_df = sentence_df.merge(
+                sent_measure_df,
+                how="left",
+                on=["subject", "trial_id", "item", "condition", "sentence_number", "sentence", "number_of_words"],
+            )
+        sentence_df = reorder_columns(
+            sentence_df, ["subject", "trial_id", "item", "condition", "sentence_number", "sentence", "number_of_words"]
+        )
+        trial["own_word_measures_dfs_for_algo"] = words_df
+        trial["own_sentence_measures_dfs_for_algo"] = sentence_df
+    dffix = reorder_columns(dffix)
+    if for_multi:
+        return dffix
+    else:
+        fix_cols_to_keep = [
+            c
+            for c in dffix.columns
+            if (
+                (any([lname in c for lname in ALL_FIX_MEASURES]) and any([lname in c for lname in fix_cols_to_add]))
+                or (not any([lname in c for lname in ALL_FIX_MEASURES]))
+            )
+        ]
+        savename = RESULTS_FOLDER.joinpath(pl.Path(trial["filename"]).stem)
+        csv_name = f"{savename}_{trial['trial_id']}_corrected_fixations.csv"
+        csv_name = export_dataframe(dffix.loc[:, fix_cols_to_keep].copy(), csv_name)
+        export_trial(trial)
+        return dffix
+def process_trial_choice(
+    trial: dict,
+    algo_choice: str,
+    choice_handle_short_and_close_fix,
+    for_multi,
+    discard_fixations_without_sfix,
+    discard_far_out_of_text_fix,
+    x_thres_in_chars,
+    y_thresh_in_heights,
+    short_fix_threshold,
+    merge_distance_threshold,
+    discard_long_fix,
+    discard_long_fix_threshold,
+    discard_blinks,
+    measures_to_calculate_multi_asc,
+    include_coords_multi_asc,
+    sent_measures_to_calculate_multi_asc,
+    classic_algos_cfg,
+    models_dict,
+    fix_cols_to_add,
+):
+    dffix, trial = trial_to_dfs(
+        trial=trial,
+        choice_handle_short_and_close_fix=choice_handle_short_and_close_fix,
+        discard_fixations_without_sfix=discard_fixations_without_sfix,
+        discard_far_out_of_text_fix=discard_far_out_of_text_fix,
+        x_thres_in_chars=x_thres_in_chars,
+        y_thresh_in_heights=y_thresh_in_heights,
+        short_fix_threshold=short_fix_threshold,
+        discard_long_fix=discard_long_fix,
+        discard_long_fix_threshold=discard_long_fix_threshold,
+        merge_distance_threshold=merge_distance_threshold,
+        discard_blinks=discard_blinks,
+    )
+    if "chars_list" in trial:
+        chars_df = pd.DataFrame(trial["chars_df"])
+        trial["chars_df"] = chars_df.to_dict()
+        trial["y_char_unique"] = list(chars_df.char_y_center.sort_values().unique())
+    if algo_choice is not None and ("chars_list" in trial or "words_list" in trial):
+        if dffix.shape[0] > 1:
+            dffix = correct_df(
+                dffix,
+                algo_choice,
+                trial,
+                for_multi=for_multi,
+                is_outside_of_streamlit=False,
+                classic_algos_cfg=classic_algos_cfg,
+                models_dict=models_dict,
+                measures_to_calculate_multi_asc=measures_to_calculate_multi_asc,
+                include_coords_multi_asc=include_coords_multi_asc,
+                sent_measures_to_calc_multi=sent_measures_to_calculate_multi_asc,
+                fix_cols_to_add=fix_cols_to_add,
+            )
+            saccade_df = get_saccade_df(dffix, trial, algo_choice, trial.pop("events_df"))
+            trial["saccade_df"] = saccade_df.to_dict()
+            fig = plot_saccade_df(dffix, saccade_df, trial, True, False)
+            fig.savefig(RESULTS_FOLDER / f"{trial['subject']}_{trial['trial_id']}_saccades.png")
+            plt.close(fig)
+        else:
+            ic(
+                f"🚨 Only {dffix.shape[0]} fixation left after processing. saccade_df not created for trial {trial['trial_id']} 🚨"
+            )
+    else:
+        ic("🚨 Stimulus information needed for fixation line-assignment 🚨")
+    for c in ["gaze_df", "dffix"]:
+        if c in trial:
+            trial.pop(c)
+    return dffix, trial
+def get_saccade_df(dffix, trial, algo_choices, events_df):
+    if not isinstance(algo_choices, list):
+        algo_choices = [algo_choices]
+    sac_df_as_detected = events_df[events_df["msg"] == "SAC"].copy()
+    last_sacc_stop_time = sac_df_as_detected["stop_uncorrected"].iloc[-1]
+    dffix_after_last_sacc = dffix.loc[dffix["start_uncorrected"] > last_sacc_stop_time, :].copy()
+    if not dffix_after_last_sacc.empty:
+        dffix_before_last_sacc = dffix.loc[dffix["start_uncorrected"] < last_sacc_stop_time, :].copy()
+        dffix = pd.concat([dffix_before_last_sacc, dffix_after_last_sacc.iloc[[0], :]], axis=0)
+    sac_df_as_detected = sac_df_as_detected[sac_df_as_detected["start"] >= dffix["end_time"].iloc[0]]
+    sac_df_as_detected = sac_df_as_detected[sac_df_as_detected["stop"] <= dffix["start_time"].iloc[-1]]
+    sac_index_keep = [
+        i for i, row in sac_df_as_detected.iterrows() if np.abs(row["start"] - dffix["start_time"].values).min() < 100
+    ]
+    sac_df_as_detected = sac_df_as_detected.loc[sac_index_keep, :]
+    starts = pd.Series(dffix["start_time"].values, dffix["start_time"])
+    ends = pd.Series(dffix["end_time"].values, dffix["end_time"])
+    starts_reind = starts.reindex(sac_df_as_detected["stop"], method="bfill").dropna()
+    ends_reind = ends.reindex(sac_df_as_detected["start"], method="ffill").dropna()
+    sac_df_as_detected_start_indexed = sac_df_as_detected.copy().set_index("start")
+    saccade_df = (
+        sac_df_as_detected_start_indexed.loc[ends_reind.index, :]
+        .reset_index(drop=False)
+        .rename({"start": "start_time", "stop": "end_time"}, axis=1)
+    )
+    saccade_df = pf.get_angle_and_eucl_dist(saccade_df)
+    # TODO maybe add incoming outgoing angle from sacc_df to dffix
+    dffix_start_indexed = dffix.copy().set_index("start_time")
+    dffix_end_indexed = dffix.copy().set_index("end_time")
+    for algo_choice in algo_choices:
+        saccade_df[f"ys_{algo_choice}"] = dffix_end_indexed.loc[ends_reind.values, f"y_{algo_choice}"].values
+        saccade_df[f"ye_{algo_choice}"] = dffix_start_indexed.loc[starts_reind.values, f"y_{algo_choice}"].values
+        saccade_df = pf.get_angle_and_eucl_dist(saccade_df, algo_choice)
+        saccade_df[f"lines_{algo_choice}"] = dffix_end_indexed.loc[ends_reind.values, f"line_num_{algo_choice}"].values
+        saccade_df[f"linee_{algo_choice}"] = dffix_start_indexed.loc[
+            starts_reind.values, f"line_num_{algo_choice}"
+        ].values
+        saccade_df[f"line_word_s_{algo_choice}"] = dffix_end_indexed.loc[
+            ends_reind.values, f"line_word_{algo_choice}"
+        ].values
+        saccade_df[f"line_word_e_{algo_choice}"] = dffix_start_indexed.loc[
+            starts_reind.values, f"line_word_{algo_choice}"
+        ].values
+        saccade_df[f"lets_{algo_choice}"] = dffix_end_indexed.loc[ends_reind.values, f"letternum_{algo_choice}"].values
+        saccade_df[f"lete_{algo_choice}"] = dffix_start_indexed.loc[
+            starts_reind.values, f"letternum_{algo_choice}"
+        ].values
+    blink_df = events_df[events_df["msg"] == "BLINK"]
+    for i in range(len(saccade_df)):
+        if saccade_df.loc[i, "start_time"] in blink_df["start"]:
+            saccade_df.loc[i, "blink"] = True
+    saccade_df = pf.compute_non_line_dependent_saccade_measures(saccade_df, trial)
+    for algo_choice in algo_choices:
+        saccade_df = pf.compute_saccade_measures(saccade_df, trial, algo_choice)
+    if "msg" in saccade_df.columns:
+        saccade_df = saccade_df.drop(axis=1, labels=["msg"])
+    saccade_df = reorder_columns(saccade_df)
+    return saccade_df.dropna(how="all", axis=1).copy()

popEye_funcs.py ADDED Viewed

	@@ -0,0 +1,1373 @@

+"""
+Mostly adapted from: https://github.com/sascha2schroeder/popEye
+"""
+import numpy as np
+import pandas as pd
+from icecream import ic
+from scipy import stats
+import pathlib as pl
+RESULTS_FOLDER = pl.Path("results")
+def compute_velocity(xy):
+    samp = 1000
+    N = xy.shape[0]
+    v = pd.DataFrame(data=np.zeros((N, 3)), columns=["time", "vx", "vy"])
+    v["time"] = xy["time"]
+    v.iloc[2 : (N - 2), 1:3] = (
+        samp
+        / 6
+        * (
+            xy.iloc[4:N, 1:3].values
+            + xy.iloc[3 : (N - 1), 1:3].values
+            - xy.iloc[1 : (N - 3), 1:3].values
+            - xy.iloc[0 : (N - 4), 1:3].values
+        )
+    )
+    v.iloc[1, 1:3] = samp / 2 * (xy.iloc[2, 1:3].values - xy.iloc[0, 1:3].values)
+    v.iloc[(N - 2), 1:3] = samp / 2 * (xy.iloc[N - 1, 1:3].values - xy.iloc[N - 4, 1:3].values)
+    xy = pd.concat([xy.set_index("time"), v.set_index("time")], axis=1).reset_index()
+    return xy
+def event_long(events_df):
+    events_df["duration"] = events_df["stop"] - events_df["start"]
+    events_df = events_df[events_df["duration"] > 0]
+    events_df = events_df.drop(columns=["duration"])
+    events_df.reset_index(drop=True, inplace=True)
+    tmplong_cols = list(events_df.columns)
+    tmplong_cols.remove("msg")
+    events_df["del"] = 0
+    for i in events_df.index:
+        if events_df.loc[i, "msg"] == "BLINK":
+            if i == 0:
+                continue
+            for col in tmplong_cols:
+                events_df.loc[i, col] = events_df.loc[i - 1, col]
+            events_df.loc[i - 1, "del"] = 1
+    events_df = events_df[events_df["del"] == 0]
+    events_df = events_df.drop(columns=["del"])
+    events_df.reset_index(drop=True, inplace=True)
+    events_df["num"] = range(len(events_df))
+    # compute blinks
+    # ---------------
+    events_df["blink_before"] = 0
+    events_df["blink_after"] = 0
+    for i in events_df.index:
+        if events_df.loc[i, "msg"] == "BLINK":
+            events_df.loc[i - 1, "blink_after"] = 1
+            if i < len(events_df) - 1:
+                events_df.loc[i + 1, "blink_before"] = 1
+    # combine
+    events_df["blink"] = (events_df["blink_before"] == 1) | (events_df["blink_after"] == 1)
+    return events_df.copy()
+def compute_non_line_dependent_saccade_measures(saccade_df, trial_dict):
+    saccade_df["trial_id"] = trial_dict["trial_id"]
+    gaze_df = trial_dict["gaze_df"]
+    for s in range(len(saccade_df)):
+        is_directional_deviation = False
+        a = saccade_df["start_time"][s]
+        b = saccade_df["end_time"][s]
+        if not gaze_df["x"][[True if (a <= x <= b) else False for x in gaze_df["time"]]].any():
+            gaze_df.loc[a:b, "x"] = np.nan
+        bool_vec = (gaze_df["time"] >= a) & (gaze_df["time"] <= b)
+        if (not gaze_df["x"][bool_vec].isna().any()) and bool_vec.any():
+            # saccade amplitude (dX, dY)
+            minx = min(gaze_df.loc[bool_vec, "x"])
+            maxx = max(gaze_df.loc[bool_vec, "x"])
+            if "calibration_method" not in trial_dict or trial_dict["calibration_method"] != "H3":
+                miny = min(gaze_df.loc[bool_vec, "y"])
+                maxy = max(gaze_df.loc[bool_vec, "y"])
+            ix1 = gaze_df.loc[bool_vec, "x"].index[np.argmin(gaze_df.loc[bool_vec, "x"])]
+            ix2 = gaze_df.loc[bool_vec, "x"].index[np.argmax(gaze_df.loc[bool_vec, "x"])]
+            if "calibration_method" not in trial_dict or trial_dict["calibration_method"] != "H3":
+                iy1 = gaze_df.loc[bool_vec, "y"].index[np.argmin(gaze_df.loc[bool_vec, "y"])]
+                iy2 = gaze_df.loc[bool_vec, "y"].index[np.argmax(gaze_df.loc[bool_vec, "y"])]
+            saccade_df.loc[s, "dX"] = round(np.sign(ix2 - ix1) * (maxx - minx))
+            if "calibration_method" not in trial_dict or trial_dict["calibration_method"] != "H3":
+                saccade_df.loc[s, "dY"] = round(np.sign(iy2 - iy1) * (maxy - miny))
+            # saccade amplitude/angle
+            if "calibration_method" not in trial_dict or trial_dict["calibration_method"] != "H3":
+                saccade_df.loc[s, "amp_px"] = round(
+                    np.sqrt(saccade_df.loc[s, "dX"] ** 2 + saccade_df.loc[s, "dY"] ** 2)
+                )
+                saccade_df.loc[s, "amp_angle"] = round(np.arctan2(saccade_df.loc[s, "dY"], saccade_df.loc[s, "dX"]), 2)
+                saccade_df.loc[s, "amp_angle_deg"] = round(
+                    np.arctan2(saccade_df.loc[s, "dY"], saccade_df.loc[s, "dX"]) * (180 / np.pi), 2
+                )
+            else:
+                saccade_df.loc[s, "amp_px"] = np.nan
+                saccade_df.loc[s, "amp_angle"] = np.nan
+                saccade_df.loc[s, "amp_angle_deg"] = np.nan
+        if 35 <= abs(saccade_df.loc[s, "angle"]) <= 145:
+            if saccade_df.loc[s, "xe"] - saccade_df.loc[s, "xs"] > 0 and not (
+                "blink_before" in saccade_df.columns
+                and (saccade_df.loc[s, "blink_before"] or saccade_df.loc[s, "blink_after"])
+            ):
+                is_directional_deviation = True
+        saccade_df.loc[s, "is_directional_deviation"] = is_directional_deviation
+    return saccade_df
+def compute_saccade_measures(saccade_df, trial_dict, algo_choice):
+    if algo_choice is not None:
+        algo_str = f"_{algo_choice}"
+    else:
+        algo_str = ""
+    gaze_df = trial_dict["gaze_df"]
+    saccade_df.reset_index(drop=True, inplace=True)
+    saccade_df.loc[:, f"has_line_change{algo_str}"] = (
+        saccade_df.loc[:, f"lines{algo_str}"] != saccade_df.loc[:, f"linee{algo_str}"]
+    )
+    saccade_df.loc[:, f"goes_to_next_line{algo_str}"] = saccade_df.loc[:, f"linee{algo_str}"] == (
+        saccade_df.loc[:, f"lines{algo_str}"] + 1
+    )
+    saccade_df.loc[:, f"is_directional_deviation{algo_str}"] = False
+    saccade_df.loc[:, f"is_return_sweep{algo_str}"] = False
+    for sidx, subdf in saccade_df.groupby(f"lines{algo_str}"):
+        if subdf.iloc[-1][f"goes_to_next_line{algo_str}"]:
+            saccade_df.loc[subdf.index[-1], f"is_return_sweep{algo_str}"] = True
+    for s in range(len(saccade_df)):
+        is_directional_deviation = False
+        a = saccade_df["start_time"][s]
+        b = saccade_df["end_time"][s]
+        if not gaze_df["x"][[True if (a <= x <= b) else False for x in gaze_df["time"]]].any():
+            gaze_df.loc[a:b, "x"] = np.nan
+        # saccade distance in letters
+        if saccade_df.loc[s, f"lete{algo_str}"] is None or saccade_df.loc[s, f"lets{algo_str}"] is None:
+            ic(
+                f"None found for compute_saccade_measures at index {s} for subj {trial_dict['subject']} and trial {trial_dict['trial_id']}"
+            )
+        else:
+            saccade_df.loc[s, f"dist_let{algo_str}"] = (
+                saccade_df.loc[s, f"lete{algo_str}"] - saccade_df.loc[s, f"lets{algo_str}"]
+            )
+        bool_vec = (gaze_df["time"] >= a) & (gaze_df["time"] <= b)
+        if (not gaze_df["x"][bool_vec].isna().any()) and bool_vec.any():
+            # saccade peak velocity (vpeak)
+            if "calibration_method" not in trial_dict or trial_dict["calibration_method"] != "H3":
+                vx = gaze_df.vx[bool_vec]
+                vy = gaze_df.vy[bool_vec]
+                if not vx.empty and not vy.empty:
+                    saccade_df.loc[s, f"peak_vel{algo_str}"] = round(np.nanmax(np.sqrt(vx**2 + vy**2)))
+            else:
+                saccade_df.loc[s, f"peak_vel{algo_str}"] = round(np.nanmax(np.sqrt(gaze_df.vx[bool_vec] ** 2)))
+        if 35 <= abs(saccade_df.loc[s, f"angle{algo_str}"]) <= 145:
+            if saccade_df.loc[s, "xe"] - saccade_df.loc[s, "xs"] > 0 and not (
+                "blink_before" in saccade_df.columns
+                and (saccade_df.loc[s, "blink_before"] or saccade_df.loc[s, "blink_after"])
+            ):
+                is_directional_deviation = True
+        saccade_df.loc[s, f"is_directional_deviation{algo_str}"] = is_directional_deviation
+    return saccade_df.copy()
+def get_angle_and_eucl_dist(saccade_df, algo_choice=None):
+    if algo_choice is not None:
+        algo_str = f"_{algo_choice}"
+    else:
+        algo_str = ""
+    saccade_df["xe_minus_xs"] = saccade_df["xe"] - saccade_df["xs"]
+    saccade_df[f"ye_minus_ys{algo_str}"] = saccade_df[f"ye{algo_str}"] - saccade_df[f"ys{algo_str}"]
+    saccade_df["eucledian_distance"] = (
+        saccade_df["xe_minus_xs"].map(np.square) + saccade_df[f"ye_minus_ys{algo_str}"].map(np.square)
+    ).map(np.sqrt)
+    saccade_df[f"angle{algo_str}"] = np.arctan2(
+        saccade_df.loc[:, f"ye_minus_ys{algo_str}"], saccade_df.loc[:, "xe_minus_xs"]
+    ) * (180 / np.pi)
+    return saccade_df
+def compute_saccade_length(dffix, stimulus_df, algo_choice):
+    for j in dffix.index:
+        if (
+            j == 0
+            or pd.isna(dffix.at[j, f"line_num_{algo_choice}"])
+            or pd.isna(dffix.at[j - 1, f"line_num_{algo_choice}"])
+            or dffix.at[j, f"letternum_{algo_choice}"] is None
+            or dffix.at[j - 1, f"letternum_{algo_choice}"] is None
+        ):
+            continue
+        # Same line, calculate saccade length as difference in letter numbers
+        if dffix.at[j - 1, f"line_num_{algo_choice}"] == dffix.at[j, f"line_num_{algo_choice}"]:
+            dffix.at[j, f"sac_in_{algo_choice}"] = (
+                dffix.at[j, f"letternum_{algo_choice}"] - dffix.at[j - 1, f"letternum_{algo_choice}"]
+            )
+        # Go to line ahead, calculate saccade length as difference in minimum letter numbers in target and previous lines, respectively
+        elif dffix.at[j - 1, f"line_num_{algo_choice}"] < dffix.at[j, f"line_num_{algo_choice}"]:
+            min_stim_j = np.min(
+                stimulus_df[stimulus_df["assigned_line"] == dffix.at[j, f"line_num_{algo_choice}"]]["letternum"]
+            )
+            min_stim_j_1 = np.min(
+                stimulus_df[stimulus_df["assigned_line"] == dffix.at[j - 1, f"line_num_{algo_choice}"]]["letternum"]
+            )
+            dffix.at[j, f"sac_in_{algo_choice}"] = (dffix.at[j, f"letternum_{algo_choice}"] - min_stim_j) - (
+                dffix.at[j - 1, f"letternum_{algo_choice}"] - min_stim_j_1
+            )
+        # Return to line visited before, calculate saccade length as difference in minimum letter numbers in target and next lines, respectively
+        elif dffix.at[j - 1, f"line_num_{algo_choice}"] > dffix.at[j, f"line_num_{algo_choice}"]:
+            min_stim_j_1 = np.min(
+                stimulus_df[stimulus_df["assigned_line"] == dffix.at[j - 1, f"line_num_{algo_choice}"]]["letternum"]
+            )
+            min_stim_j = np.min(
+                stimulus_df[stimulus_df["assigned_line"] == dffix.at[j, f"line_num_{algo_choice}"]]["letternum"]
+            )
+            dffix.at[j, f"sac_in_{algo_choice}"] = (dffix.at[j - 1, f"letternum_{algo_choice}"] - min_stim_j_1) - (
+                dffix.at[j, f"letternum_{algo_choice}"] - min_stim_j
+            )
+    for j in range(len(dffix) - 1):
+        if (
+            pd.isna(dffix.at[j, f"line_num_{algo_choice}"])
+            or pd.isna(dffix.at[j + 1, f"line_num_{algo_choice}"])
+            or dffix.at[j + 1, f"letternum_{algo_choice}"] is None
+            or dffix.at[j, f"letternum_{algo_choice}"] is None
+        ):
+            continue
+        # Same line, calculate saccade length as difference in letter numbers
+        if dffix.at[j + 1, f"line_num_{algo_choice}"] == dffix.at[j, f"line_num_{algo_choice}"]:
+            dffix.at[j, f"sac_out_{algo_choice}"] = (
+                dffix.at[j + 1, f"letternum_{algo_choice}"] - dffix.at[j, f"letternum_{algo_choice}"]
+            )
+        elif dffix.at[j + 1, f"line_num_{algo_choice}"] > dffix.at[j, f"line_num_{algo_choice}"]:
+            min_stim_j_1 = np.min(
+                stimulus_df[stimulus_df["assigned_line"] == dffix.at[j + 1, f"line_num_{algo_choice}"]]["letternum"]
+            )
+            min_stim_j = np.min(
+                stimulus_df[stimulus_df["assigned_line"] == dffix.at[j, f"line_num_{algo_choice}"]]["letternum"]
+            )
+            dffix.at[j, f"sac_out_{algo_choice}"] = (dffix.at[j + 1, f"letternum_{algo_choice}"] - min_stim_j_1) - (
+                dffix.at[j, f"letternum_{algo_choice}"] - min_stim_j
+            )
+        elif dffix.at[j + 1, f"line_num_{algo_choice}"] < dffix.at[j, f"line_num_{algo_choice}"]:
+            min_stim_j_1 = np.min(
+                stimulus_df[stimulus_df["assigned_line"] == dffix.at[j, f"line_num_{algo_choice}"]]["letternum"]
+            )
+            min_stim_j = np.min(
+                stimulus_df[stimulus_df["assigned_line"] == dffix.at[j + 1, f"line_num_{algo_choice}"]]["letternum"]
+            )
+            dffix.at[j, f"sac_out_{algo_choice}"] = (dffix.at[j, f"letternum_{algo_choice}"] - min_stim_j) - (
+                dffix.at[j + 1, f"letternum_{algo_choice}"] - min_stim_j_1
+            )
+    return dffix
+def compute_launch_distance(dffix, algo_choice):
+    for i in range(1, dffix.shape[0]):
+        if pd.isna(dffix.loc[i, f"sac_in_{algo_choice}"]):
+            continue
+        if dffix.loc[i, f"sac_in_{algo_choice}"] >= 0:
+            dffix.loc[i, f"word_launch_{algo_choice}"] = (
+                dffix.loc[i, f"sac_in_{algo_choice}"] - dffix.loc[i, f"word_land_{algo_choice}"]
+            )
+        else:
+            dffix.loc[i, f"word_launch_{algo_choice}"] = (
+                dffix.loc[i, f"sac_in_{algo_choice}"] + dffix.loc[i - 1, f"word_land_{algo_choice}"]
+            )
+    return dffix
+def compute_refixation(dffix, algo_choice):
+    dffix.loc[:, f"word_refix_{algo_choice}"] = False
+    dffix.loc[:, f"sentence_refix_{algo_choice}"] = False
+    for j in dffix.index:
+        if (
+            j == 0
+            or pd.isna(dffix.loc[j, f"on_word_number_{algo_choice}"])
+            or pd.isna(dffix.loc[j - 1, f"on_word_number_{algo_choice}"])
+        ):
+            continue
+        dffix.loc[j, f"word_refix_{algo_choice}"] = (
+            dffix.loc[j, f"on_word_number_{algo_choice}"] == dffix.loc[j - 1, f"on_word_number_{algo_choice}"]
+        )
+        dffix.loc[j, f"sentence_refix_{algo_choice}"] = (
+            dffix.loc[j, f"on_sentence_num_{algo_choice}"] == dffix.loc[j - 1, f"on_sentence_num_{algo_choice}"]
+        )
+    return dffix
+def compute_regression(dffix, algo_choice):
+    tmp = dffix.copy()
+    tmp.reset_index(drop=True, inplace=True)
+    tmp.loc[:, f"word_reg_out_{algo_choice}"] = False
+    tmp.loc[:, f"word_reg_in_{algo_choice}"] = False
+    tmp.loc[:, f"word_reg_out_to_{algo_choice}"] = float("nan")
+    tmp.loc[:, f"word_reg_in_from_{algo_choice}"] = float("nan")
+    tmp.loc[:, f"sentence_reg_out_{algo_choice}"] = False
+    tmp.loc[:, f"sentence_reg_in_{algo_choice}"] = False
+    tmp.loc[:, f"sentence_reg_out_to_{algo_choice}"] = float("nan")
+    tmp.loc[:, f"sentence_reg_in_from_{algo_choice}"] = float("nan")
+    if len(tmp) > 1:
+        for j in range(1, len(tmp)):
+            # Skip outliers
+            if pd.isnull(tmp.iloc[j][f"on_word_number_{algo_choice}"]) or pd.isnull(
+                tmp.iloc[j - 1][f"on_word_number_{algo_choice}"]
+            ):
+                continue
+            # Word
+            if tmp.iloc[j][f"on_word_number_{algo_choice}"] < tmp.iloc[j - 1][f"on_word_number_{algo_choice}"]:
+                tmp.loc[j, f"word_reg_in_{algo_choice}"] = True
+                tmp.loc[j - 1, f"word_reg_out_{algo_choice}"] = True
+                tmp.loc[j, f"word_reg_in_from_{algo_choice}"] = tmp.iloc[j - 1][f"on_word_number_{algo_choice}"]
+                tmp.loc[j - 1, f"word_reg_out_to_{algo_choice}"] = tmp.iloc[j][f"on_word_number_{algo_choice}"]
+            # Sentence
+            if tmp.iloc[j][f"on_sentence_num_{algo_choice}"] < tmp.iloc[j - 1][f"on_sentence_num_{algo_choice}"]:
+                tmp.loc[j, f"sentence_reg_in_{algo_choice}"] = True
+                tmp.loc[j - 1, f"sentence_reg_out_{algo_choice}"] = True
+                tmp.loc[j, f"sentence_reg_in_from_{algo_choice}"] = tmp.iloc[j - 1][f"on_sentence_num_{algo_choice}"]
+                tmp.loc[j - 1, f"sentence_reg_out_to_{algo_choice}"] = tmp.iloc[j][f"on_sentence_num_{algo_choice}"]
+    extra_cols = list(set(tmp.columns) - set(dffix.columns))
+    # select these columns from tmp and add the 'fixation_number'
+    cols_to_add = ["fixation_number"] + extra_cols
+    # merge selected columns to dffix with 'outer' how and 'fixation_number' as common key
+    dffix = pd.merge(dffix, tmp[cols_to_add], on="fixation_number", how="outer")
+    return dffix
+def compute_firstskip(dffix, algo_choice):
+    dffix[f"word_firstskip_{algo_choice}"] = 0
+    word_mem = []
+    dffix[f"sentence_firstskip_{algo_choice}"] = 0
+    sentence_mem = []
+    dffix.reset_index(inplace=True)
+    for j in range(dffix.shape[0]):
+        # word
+        if (
+            dffix.loc[j, f"on_word_number_{algo_choice}"] < np.max(word_mem, initial=0)
+            and dffix.loc[j, f"on_word_number_{algo_choice}"] not in word_mem
+        ):
+            dffix.loc[j, f"word_firstskip_{algo_choice}"] = 1
+        # sent
+        if (
+            dffix.loc[j, f"on_sentence_num_{algo_choice}"] < np.max(sentence_mem, initial=0)
+            and dffix.loc[j, f"on_sentence_num_{algo_choice}"] not in sentence_mem
+        ):
+            dffix.loc[j, f"sentence_firstskip_{algo_choice}"] = 1
+        word_mem.append(dffix.loc[j, f"on_word_number_{algo_choice}"])
+        sentence_mem.append(dffix.loc[j, f"on_sentence_num_{algo_choice}"])
+    # set NA values for missing line numbers
+    dffix.loc[dffix[f"line_num_{algo_choice}"].isna(), f"word_firstskip_{algo_choice}"] = np.nan
+    dffix.loc[dffix[f"line_num_{algo_choice}"].isna(), f"sentence_firstskip_{algo_choice}"] = np.nan
+    dffix.set_index("index", inplace=True)
+    return dffix
+def compute_run(dffix, algo_choice):
+    if "fixation_number" not in dffix.columns and "num" in dffix.columns:
+        dffix["fixation_number"] = dffix["num"]
+    tmp = dffix.copy()
+    tmp.reset_index(inplace=True, drop=True)
+    # initialize
+    tmp.loc[~tmp[f"on_word_{algo_choice}"].isna(), f"word_runid_{algo_choice}"] = 0
+    tmp[f"sentence_runid_{algo_choice}"] = 0
+    # fixation loop
+    if len(tmp) > 1:
+        for j in range(1, len(tmp)):
+            # word
+            if tmp[f"word_reg_in_{algo_choice}"][j] == 1 and tmp[f"word_reg_in_{algo_choice}"][j - 1] != 1:
+                tmp.loc[j, f"word_runid_{algo_choice}"] = tmp[f"word_runid_{algo_choice}"][j - 1] + 1
+            else:
+                tmp.loc[j, f"word_runid_{algo_choice}"] = tmp.loc[j - 1, f"word_runid_{algo_choice}"]
+            # sentence
+            if tmp[f"sentence_reg_in_{algo_choice}"][j] == 1 and tmp[f"sentence_reg_in_{algo_choice}"][j - 1] != 1:
+                tmp.loc[j, f"sentence_runid_{algo_choice}"] = tmp[f"sentence_runid_{algo_choice}"][j - 1] + 1
+            else:
+                tmp.loc[j, f"sentence_runid_{algo_choice}"] = tmp[f"sentence_runid_{algo_choice}"][j - 1]
+    tmp[f"word_runid_{algo_choice}"] = tmp[f"word_runid_{algo_choice}"] - 1
+    tmp[f"sentence_runid_{algo_choice}"] = tmp[f"sentence_runid_{algo_choice}"] - 1
+    # fixid in word
+    tmp[f"word_fix_{algo_choice}"] = tmp.groupby(f"on_word_number_{algo_choice}")["fixation_number"].transform(
+        lambda x: stats.rankdata(x, method="min")
+    )
+    # fixid in sent
+    tmp[f"sentence_fix_{algo_choice}"] = tmp.groupby(f"on_sentence_num_{algo_choice}")["fixation_number"].transform(
+        lambda x: stats.rankdata(x, method="min")
+    )
+    # runid in word
+    tmp["id"] = tmp[f"on_word_number_{algo_choice}"].astype(str) + ":" + tmp[f"word_runid_{algo_choice}"].astype(str)
+    fix_tmp = tmp.copy().drop_duplicates(subset="id")
+    fix_tmp[f"word_run_{algo_choice}"] = fix_tmp.groupby(f"on_word_number_{algo_choice}")[
+        f"word_runid_{algo_choice}"
+    ].transform(lambda x: stats.rankdata(x, method="min"))
+    if f"word_run_{algo_choice}" in tmp.columns:
+        tmp = tmp.drop(columns=[f"word_run_{algo_choice}"])
+    tmp = pd.merge(tmp, fix_tmp[["id", f"word_run_{algo_choice}"]], on="id")
+    del tmp["id"]
+    tmp = tmp.sort_values("fixation_number")
+    # runid in sentence
+    tmp["id"] = (
+        tmp[f"on_sentence_num_{algo_choice}"].astype(str) + ":" + tmp[f"sentence_runid_{algo_choice}"].astype(str)
+    )
+    fix_tmp = tmp.copy().drop_duplicates(subset="id")
+    fix_tmp[f"sentence_run_{algo_choice}"] = fix_tmp.groupby(f"on_sentence_num_{algo_choice}")["id"].transform(
+        lambda x: stats.rankdata(x, method="min")
+    )
+    if f"sentence_run_{algo_choice}" in tmp.columns:
+        tmp = tmp.drop(columns=[f"sentence_run_{algo_choice}"])
+    tmp = pd.merge(tmp, fix_tmp[["id", f"sentence_run_{algo_choice}"]], on="id")
+    del tmp["id"]
+    tmp = tmp.sort_values("fixation_number")
+    # fixnum in word_run
+    tmp["id"] = tmp[f"on_word_number_{algo_choice}"].astype(str) + ":" + tmp[f"word_run_{algo_choice}"].astype(str)
+    tmp[f"word_run_fix_{algo_choice}"] = tmp.groupby(["id"])["fixation_number"].rank("first").values
+    del tmp["id"]
+    tmp = tmp.sort_values("fixation_number")
+    # fixnum in sentence_run
+    tmp["id"] = tmp[f"on_sentence_num_{algo_choice}"].astype(str) + ":" + tmp[f"sentence_run_{algo_choice}"].astype(str)
+    tmp[f"sentence_run_fix_{algo_choice}"] = tmp.groupby(["id"])["fixation_number"].rank("first").values
+    del tmp["id"]
+    tmp = tmp.sort_values("fixation_number")
+    names = [
+        "fixation_number",
+        f"word_runid_{algo_choice}",
+        f"sentence_runid_{algo_choice}",
+        f"word_fix_{algo_choice}",
+        f"sentence_fix_{algo_choice}",
+        f"word_run_{algo_choice}",
+        f"sentence_run_{algo_choice}",
+        f"word_run_fix_{algo_choice}",
+        f"sentence_run_fix_{algo_choice}",
+    ]
+    dffix = pd.merge(dffix, tmp[names], on="fixation_number", how="left")
+    return dffix.copy()
+def compute_landing_position(dffix, algo_choice):
+    dffix[f"word_cland_{algo_choice}"] = (
+        dffix[f"word_land_{algo_choice}"] - (dffix[f"on_word_{algo_choice}"].str.len() + 1) / 2
+    )
+    return dffix
+def aggregate_words_firstrun(
+    fix,
+    algo_choice,
+    measures_to_calculate=[
+        "firstrun_blink",
+        "firstrun_skip",
+        "firstrun_refix",
+        "firstrun_reg_in",
+        "firstrun_reg_out",
+        "firstrun_dur",
+        "firstrun_gopast",
+        "firstrun_gopast_sel",
+    ],
+):
+    firstruntmp = fix.loc[fix[f"word_run_{algo_choice}"] == 1].copy()
+    firstrun = firstruntmp.drop_duplicates(subset=f"on_word_number_{algo_choice}", keep="first").copy()
+    names = [
+        "subject",
+        "trial_id",
+        "item",
+        "condition",
+        f"on_word_number_{algo_choice}",
+        f"on_word_{algo_choice}",
+        "fixation_number",
+    ]
+    firstrun = firstrun[names].sort_values(f"on_word_number_{algo_choice}")
+    # compute measures
+    firstrun[f"firstrun_nfix_{algo_choice}"] = firstruntmp.groupby(f"on_word_number_{algo_choice}")[
+        "fixation_number"
+    ].transform(
+        "count"
+    )  # Required for many other measures
+    firstrun[f"firstrun_nfix_{algo_choice}"] = firstrun[f"firstrun_nfix_{algo_choice}"].fillna(0)
+    if "firstrun_blink" in measures_to_calculate:
+        if "blink" in firstruntmp:
+            firstrun[f"firstrun_blink_{algo_choice}"] = firstruntmp.groupby(f"on_word_number_{algo_choice}")[
+                "blink"
+            ].transform("max")
+        else:
+            firstrun[f"firstrun_blink_{algo_choice}"] = 0
+    if "firstrun_skip" in measures_to_calculate:
+        firstrun[f"firstrun_skip_{algo_choice}"] = firstruntmp.groupby(f"on_word_number_{algo_choice}")[
+            f"word_firstskip_{algo_choice}"
+        ].transform("max")
+    if "firstrun_refix" in measures_to_calculate:
+        firstrun[f"firstrun_refix_{algo_choice}"] = firstruntmp.groupby(f"on_word_number_{algo_choice}")[
+            f"word_refix_{algo_choice}"
+        ].transform("max")
+    if "firstrun_reg_in" in measures_to_calculate:
+        firstrun[f"firstrun_reg_in_{algo_choice}"] = firstruntmp.groupby(f"on_word_number_{algo_choice}")[
+            f"word_reg_out_{algo_choice}"
+        ].transform("max")
+    if "firstrun_reg_out" in measures_to_calculate:
+        firstrun[f"firstrun_reg_out_{algo_choice}"] = firstruntmp.groupby(f"on_word_number_{algo_choice}")[
+            f"word_reg_in_{algo_choice}"
+        ].transform("max")
+    if "firstrun_dur" in measures_to_calculate:
+        firstrun[f"firstrun_dur_{algo_choice}"] = firstruntmp.groupby(f"on_word_number_{algo_choice}")[
+            "duration"
+        ].transform("sum")
+    firstrun = firstrun.sort_values(["trial_id", f"on_word_number_{algo_choice}"]).copy()
+    return firstrun
+def compute_gopast_word(fixations_dataframe, algo_choice):
+    ias = np.unique(fixations_dataframe.loc[:, f"on_word_number_{algo_choice}"])
+    for j in range(len(ias) - 1):
+        fixations_dataframe.loc[
+            (fixations_dataframe[f"on_word_number_{algo_choice}"] == ias[j]), f"gopast_{algo_choice}"
+        ] = np.nansum(
+            fixations_dataframe.loc[
+                (
+                    fixations_dataframe["fixation_number"]
+                    >= np.min(
+                        fixations_dataframe.loc[
+                            (fixations_dataframe[f"on_word_number_{algo_choice}"] == ias[j]), "fixation_number"
+                        ]
+                    )
+                )
+                & (
+                    fixations_dataframe["fixation_number"]
+                    < np.min(
+                        fixations_dataframe.loc[
+                            (fixations_dataframe[f"on_word_number_{algo_choice}"] > ias[j]), "fixation_number"
+                        ]
+                    )
+                )
+                & (~fixations_dataframe[f"on_word_number_{algo_choice}"].isna())
+            ]["duration"]
+        )
+        fixations_dataframe.loc[
+            (fixations_dataframe[f"on_word_number_{algo_choice}"] == ias[j]), f"selgopast_{algo_choice}"
+        ] = np.nansum(
+            fixations_dataframe.loc[
+                (
+                    fixations_dataframe["fixation_number"]
+                    >= np.min(
+                        fixations_dataframe.loc[
+                            (fixations_dataframe[f"on_word_number_{algo_choice}"] == ias[j]), "fixation_number"
+                        ]
+                    )
+                )
+                & (
+                    fixations_dataframe["fixation_number"]
+                    < np.min(
+                        fixations_dataframe.loc[
+                            (fixations_dataframe[f"on_word_number_{algo_choice}"] > ias[j]), "fixation_number"
+                        ]
+                    )
+                )
+                & (fixations_dataframe[f"on_word_number_{algo_choice}"] == ias[j])
+                & (~fixations_dataframe[f"on_word_number_{algo_choice}"].isna())
+            ]["duration"]
+        )
+    return fixations_dataframe
+def aggregate_words(
+    fix,
+    word_item,
+    algo_choice,
+    measures_to_calculate=[
+        "blink",
+    ],
+):
+    wordtmp = fix.copy()
+    word = wordtmp.drop_duplicates(subset=f"on_word_number_{algo_choice}", keep="first").copy()
+    names = [
+        f"on_sentence_num_{algo_choice}",
+        f"on_word_number_{algo_choice}",
+        f"on_word_{algo_choice}",
+    ]
+    word = word.loc[:, names].sort_values(by=f"on_word_number_{algo_choice}")
+    wordtmp = compute_gopast_word(wordtmp, algo_choice)
+    if "blink" in measures_to_calculate:
+        if "blink" in wordtmp:
+            word[f"blink_{algo_choice}"] = wordtmp.groupby(f"on_word_number_{algo_choice}")["blink"].transform("max")
+        else:
+            word[f"blink_{algo_choice}"] = 0
+    if "nrun" in measures_to_calculate or "reread" in measures_to_calculate:
+        word[f"nrun_{algo_choice}"] = wordtmp.groupby(f"on_word_number_{algo_choice}")[
+            f"word_run_{algo_choice}"
+        ].transform("max")
+    if "reread" in measures_to_calculate:
+        word[f"reread_{algo_choice}"] = word[f"nrun_{algo_choice}"] > 1
+    word[f"number_of_fixations_{algo_choice}"] = wordtmp.groupby(f"on_word_number_{algo_choice}")[
+        "fixation_number"
+    ].transform("count")
+    if "refix" in measures_to_calculate:
+        word[f"refix_{algo_choice}"] = wordtmp.groupby(f"on_word_number_{algo_choice}")[
+            f"word_refix_{algo_choice}"
+        ].transform("max")
+    if "reg_in" in measures_to_calculate:
+        word[f"reg_in_{algo_choice}"] = wordtmp.groupby(f"on_word_number_{algo_choice}")[
+            f"word_reg_in_{algo_choice}"
+        ].transform("max")
+    if "reg_out" in measures_to_calculate:
+        word[f"reg_out_{algo_choice}"] = wordtmp.groupby(f"on_word_number_{algo_choice}")[
+            f"word_reg_out_{algo_choice}"
+        ].transform("max")
+    if "total_fixation_duration" in measures_to_calculate:
+        word[f"total_fixation_duration_{algo_choice}"] = wordtmp.groupby(f"on_word_number_{algo_choice}")[
+            "duration"
+        ].transform("sum")
+    if "gopast" in measures_to_calculate and f"gopast_{algo_choice}" in wordtmp.columns:
+        word[f"gopast_{algo_choice}"] = wordtmp.groupby(f"on_word_number_{algo_choice}")[
+            f"gopast_{algo_choice}"
+        ].transform("max")
+        word[f"gopast_{algo_choice}"] = word[f"gopast_{algo_choice}"].fillna(0)
+    if "gopast_sel" in measures_to_calculate and f"selgopast_{algo_choice}" in wordtmp.columns:
+        word[f"gopast_sel_{algo_choice}"] = wordtmp.groupby(f"on_word_number_{algo_choice}")[
+            f"selgopast_{algo_choice}"
+        ].transform("max")
+        word[f"gopast_sel_{algo_choice}"] = word[f"gopast_sel_{algo_choice}"].fillna(0)
+    word.rename({f"on_word_number_{algo_choice}": "word_number"}, axis=1, inplace=True)
+    word = pd.merge(
+        word.reset_index(drop=True), word_item.reset_index(drop=True), on="word_number", how="right", validate="1:1"
+    )
+    word[f"number_of_fixations_{algo_choice}"] = word[f"number_of_fixations_{algo_choice}"].fillna(0)
+    if "total_fixation_duration" in measures_to_calculate:
+        word[f"total_fixation_duration_{algo_choice}"] = word[f"total_fixation_duration_{algo_choice}"].fillna(0)
+    word[f"skip_{algo_choice}"] = 0
+    if "blink" in measures_to_calculate:
+        word.loc[word[f"blink_{algo_choice}"].isna(), f"skip_{algo_choice}"] = 1
+    word.loc[word[f"number_of_fixations_{algo_choice}"] == 0, f"skip_{algo_choice}"] = 1
+    word[f"skip_{algo_choice}"] = word[f"skip_{algo_choice}"].astype("boolean")
+    if "number_of_fixations" not in measures_to_calculate:
+        word = word.drop(columns=f"number_of_fixations_{algo_choice}")
+    if "blink" in measures_to_calculate:
+        word[f"blink_{algo_choice}"] = word[f"blink_{algo_choice}"].astype("boolean")
+    word = word.sort_values(by=["word_number"])
+    if "condition" in wordtmp.columns and "condition" not in word.columns:
+        word.insert(loc=0, column="condition", value=wordtmp["condition"].iloc[0])
+    if "item" in wordtmp.columns and "item" not in word.columns:
+        word.insert(loc=0, column="item", value=wordtmp["item"].iloc[0])
+    if "trial_id" in wordtmp.columns and "trial_id" not in word.columns:
+        word.insert(loc=0, column="trial_id", value=wordtmp["trial_id"].iloc[0])
+    if "subject" in wordtmp.columns and "subject" not in word.columns:
+        word.insert(loc=0, column="subject", value=wordtmp["subject"].iloc[0])
+    return word
+def combine_words(fix, wordfirst, wordtmp, algo_choice, measures_to_calculate):
+    subject = wordtmp["subject"].values[0]
+    trial_id = wordtmp["trial_id"].values[0]
+    item = wordtmp["item"].values[0]
+    condition = wordtmp["condition"].values[0]
+    wordtmp = wordtmp.loc[
+        :,
+        [
+            c
+            for c in [
+                "word_number",
+                "word",
+                f"blink_{algo_choice}",
+                f"skip_{algo_choice}",
+                f"nrun_{algo_choice}",
+                f"reread_{algo_choice}",
+                f"number_of_fixations_{algo_choice}",
+                f"refix_{algo_choice}",
+                f"reg_in_{algo_choice}",
+                f"reg_out_{algo_choice}",
+                f"total_fixation_duration_{algo_choice}",
+                f"gopast_{algo_choice}",
+                f"gopast_sel_{algo_choice}",
+            ]
+            if c in wordtmp.columns
+        ],
+    ]
+    wordfirsttmp = wordfirst.loc[
+        :,
+        [
+            c
+            for c in [
+                f"on_word_number_{algo_choice}",
+                f"firstrun_skip_{algo_choice}",
+                f"firstrun_nfix_{algo_choice}",
+                f"firstrun_refix_{algo_choice}",
+                f"firstrun_reg_in_{algo_choice}",
+                f"firstrun_reg_out_{algo_choice}",
+                f"firstrun_dur_{algo_choice}",
+                f"firstrun_gopast_{algo_choice}",
+                f"firstrun_gopast_sel_{algo_choice}",
+            ]
+            if c in wordfirst.columns
+        ],
+    ]
+    fixtmp = fix[(fix[f"word_run_{algo_choice}"] == 1) & (fix[f"word_run_fix_{algo_choice}"] == 1)].copy()
+    names = [
+        c
+        for c in [
+            f"on_word_number_{algo_choice}",
+            f"sac_in_{algo_choice}",
+            f"sac_out_{algo_choice}",
+            f"word_launch_{algo_choice}",
+            f"word_land_{algo_choice}",
+            f"word_cland_{algo_choice}",
+            f"duration",
+        ]
+        if c in fixtmp.columns
+    ]
+    fixtmp = fixtmp[names].copy()
+    fixtmp.rename(
+        {
+            f"sac_in_{algo_choice}": f"firstfix_sac_in_{algo_choice}",
+            f"sac_out_{algo_choice}": f"firstfix_sac_out_{algo_choice}",
+            f"word_launch_{algo_choice}": f"firstfix_launch_{algo_choice}",
+            f"word_land_{algo_choice}": f"firstfix_land_{algo_choice}",
+            f"word_cland_{algo_choice}": f"firstfix_cland_{algo_choice}",
+            f"duration": f"firstfix_dur_{algo_choice}",
+        },
+        axis=1,
+        inplace=True,
+    )
+    comb = pd.merge(
+        pd.merge(
+            wordtmp,
+            wordfirsttmp.rename({f"on_word_number_{algo_choice}": "word_number"}, axis=1),
+            on="word_number",
+            how="left",
+        ),
+        fixtmp.rename({f"on_word_number_{algo_choice}": "word_number"}, axis=1),
+        on="word_number",
+        how="left",
+    )
+    dropcols = [
+        c
+        for c in [
+            f"firstrun_skip_{algo_choice}",
+            f"firstrun_refix_{algo_choice}",
+            f"firstrun_reg_in_{algo_choice}",
+            f"firstrun_reg_out_{algo_choice}",
+            f"firstrun_dur_{algo_choice}",
+            f"firstrun_gopast_{algo_choice}",
+            f"firstrun_gopast_sel_{algo_choice}",
+            f"firstfix_sac_in_{algo_choice}",
+            f"firstfix_sac_out_{algo_choice}",
+            f"firstfix_launch_{algo_choice}",
+            f"firstfix_land_{algo_choice}",
+            f"firstfix_cland_{algo_choice}",
+            f"firstfix_dur_{algo_choice}",
+        ]
+        if ((c.replace(f"_{algo_choice}", "") not in measures_to_calculate) & (c in comb.columns))
+    ]
+    comb = comb.drop(columns=dropcols).copy()
+    comb.sort_values(by="word_number", inplace=True)
+    # recompute firstrun skip (skips are also firstkips)
+    if f"skip_{algo_choice}" in comb.columns and f"firstrun_skip_{algo_choice}" in comb.columns:
+        comb.loc[comb[f"skip_{algo_choice}"] == 1, f"firstrun_skip_{algo_choice}"] = 1
+    # gopast time in firstrun
+    if f"gopast_{algo_choice}" in comb.columns and "firstrun_gopast" in measures_to_calculate:
+        comb[f"firstrun_gopast_{algo_choice}"] = comb[f"gopast_{algo_choice}"]
+    if f"gopast_sel_{algo_choice}" in comb.columns and "firstrun_gopast_sel" in measures_to_calculate:
+        comb[f"firstrun_gopast_sel_{algo_choice}"] = comb[f"gopast_sel_{algo_choice}"]
+    if f"gopast_{algo_choice}" in comb.columns:
+        comb.drop(columns=[f"gopast_{algo_choice}"], inplace=True)
+    if f"gopast_sel_{algo_choice}" in comb.columns:
+        comb.drop(columns=[f"gopast_sel_{algo_choice}"], inplace=True)
+    if f"firstrun_nfix_{algo_choice}" in comb.columns and "singlefix" in measures_to_calculate:
+        comb[f"singlefix_{algo_choice}"] = 0
+        comb.loc[(comb[f"firstrun_nfix_{algo_choice}"] == 1), f"singlefix_{algo_choice}"] = 1
+    if f"firstfix_sac_in_{algo_choice}" in comb.columns and "singlefix_sac_in" in measures_to_calculate:
+        comb.loc[(comb[f"firstrun_nfix_{algo_choice}"] == 1), f"singlefix_sac_in_{algo_choice}"] = comb[
+            f"firstfix_sac_in_{algo_choice}"
+        ][(comb[f"firstrun_nfix_{algo_choice}"] == 1)]
+    if f"firstfix_sac_out_{algo_choice}" in comb.columns and "singlefix_sac_out" in measures_to_calculate:
+        comb.loc[(comb[f"firstrun_nfix_{algo_choice}"] == 1), f"singlefix_sac_out_{algo_choice}"] = comb[
+            f"firstfix_sac_out_{algo_choice}"
+        ][(comb[f"firstrun_nfix_{algo_choice}"] == 1)]
+    if f"firstfix_launch_{algo_choice}" in comb.columns and "singlefix_launch" in measures_to_calculate:
+        comb.loc[(comb[f"firstrun_nfix_{algo_choice}"] == 1), f"singlefix_launch_{algo_choice}"] = comb[
+            f"firstfix_launch_{algo_choice}"
+        ][(comb[f"firstrun_nfix_{algo_choice}"] == 1)]
+    if f"firstfix_land_{algo_choice}" in comb.columns and "singlefix_land" in measures_to_calculate:
+        comb.loc[(comb[f"firstrun_nfix_{algo_choice}"] == 1), f"singlefix_land_{algo_choice}"] = comb[
+            f"firstfix_land_{algo_choice}"
+        ][(comb[f"firstrun_nfix_{algo_choice}"] == 1)]
+    if f"firstfix_cland_{algo_choice}" in comb.columns and "singlefix_cland" in measures_to_calculate:
+        comb.loc[(comb[f"firstrun_nfix_{algo_choice}"] == 1), f"singlefix_cland_{algo_choice}"] = comb[
+            f"firstfix_cland_{algo_choice}"
+        ][(comb[f"firstrun_nfix_{algo_choice}"] == 1)]
+    if f"firstfix_dur_{algo_choice}" in comb.columns and "singlefix_dur" in measures_to_calculate:
+        comb.loc[(comb[f"firstrun_nfix_{algo_choice}"] == 1), f"singlefix_dur_{algo_choice}"] = comb[
+            f"firstfix_dur_{algo_choice}"
+        ][(comb[f"firstrun_nfix_{algo_choice}"] == 1)]
+    if "condition" not in comb.columns:
+        comb.insert(loc=0, column="condition", value=condition)
+    if "item" not in comb.columns:
+        comb.insert(loc=0, column="item", value=item)
+    if "trial_id" not in comb.columns:
+        comb.insert(loc=0, column="trial_id", value=trial_id)
+    if "subject" not in comb.columns:
+        comb.insert(loc=0, column="subject", value=subject)
+    return comb.copy()
+def compute_sentence_measures(fix, stimmat, algo_choice, measures_to_calc, save_to_csv=False):
+    sentitem = stimmat.drop_duplicates(
+        subset="in_sentence_number", keep="first"
+    )  # TODO check why there are rows with sent number None
+    fixin = fix.copy().reset_index(drop=True)
+    fixin["on_sentence_num2"] = fixin[f"on_sentence_num_{algo_choice}"].copy()
+    # Recompute sentence number (two fixation exception rule)
+    for j in range(1, len(fixin) - 1):
+        if fixin.loc[j, "on_sentence_num2"] != fixin.loc[j - 1, "on_sentence_num2"]:
+            if j + 1 in fixin.index and fixin.loc[j + 1, "on_sentence_num2"] == fixin.loc[j - 1, "on_sentence_num2"]:
+                fixin.loc[j, "on_sentence_num2"] = fixin.loc[j - 1, "on_sentence_num2"]
+            elif j + 2 in fixin.index and fixin.loc[j + 2, "on_sentence_num2"] == fixin.loc[j - 1, "on_sentence_num2"]:
+                fixin.loc[j, "on_sentence_num2"] = fixin.loc[j - 1, "on_sentence_num2"]
+    fixin["id"] = fixin.apply(lambda row: f"{row['on_sentence_num2']}", axis=1)
+    fixin[f"sent_reg_in2_{algo_choice}"] = 0
+    fixin[f"sent_reg_out2_{algo_choice}"] = 0
+    fixin[f"sent_runid2_{algo_choice}"] = 1
+    fixin.loc[0, "last"] = fixin.loc[0, "id"]
+    fixin.loc[0, f"firstpass_{algo_choice}"] = 1
+    mem = [fixin.loc[0, "on_sentence_num2"]]
+    wordmem = [fixin.loc[0, f"on_word_number_{algo_choice}"]]
+    fixin.loc[0, f"forward_{algo_choice}"] = 1
+    for j in range(1, len(fixin)):
+        fixin.loc[j, "last"] = fixin.loc[j - 1, "id"]
+        if fixin.loc[j, "on_sentence_num2"] != fixin.loc[j - 1, "on_sentence_num2"]:
+            fixin.loc[j, f"sent_reg_in2_{algo_choice}"] = 1
+            fixin.loc[j - 1, f"sent_reg_out2_{algo_choice}"] = 1
+            fixin.loc[j, f"sent_reg_in_from2_{algo_choice}"] = fixin.loc[j - 1, "on_sentence_num2"]
+            fixin.loc[j - 1, f"sent_reg_out_to2_{algo_choice}"] = fixin.loc[j, "on_sentence_num2"]
+        if fixin.loc[j, f"sent_reg_in2_{algo_choice}"] == 1 and fixin.loc[j - 1, f"sent_reg_in2_{algo_choice}"] != 1:
+            fixin.loc[j, f"sent_runid2_{algo_choice}"] = fixin.loc[j - 1, f"sent_runid2_{algo_choice}"] + 1
+        else:
+            fixin.loc[j, f"sent_runid2_{algo_choice}"] = fixin.loc[j - 1, f"sent_runid2_{algo_choice}"]
+        if fixin.loc[j, "on_sentence_num2"] >= fixin.loc[j - 1, "on_sentence_num2"]:
+            if fixin.loc[j, "on_sentence_num2"] in mem:
+                if fixin.loc[j, "on_sentence_num2"] == max(mem):
+                    fixin.loc[j, f"firstpass_{algo_choice}"] = 1
+                else:
+                    fixin.loc[j, f"firstpass_{algo_choice}"] = 0
+            else:
+                mem.append(fixin.loc[j, "on_sentence_num2"])
+                fixin.loc[j, f"firstpass_{algo_choice}"] = 1
+        else:
+            fixin.loc[j, f"firstpass_{algo_choice}"] = 0
+        if fixin.loc[j, f"on_word_number_{algo_choice}"] > max(wordmem):
+            wordmem.append(fixin.loc[j, f"on_word_number_{algo_choice}"])
+            fixin.loc[j, f"forward_{algo_choice}"] = 1
+        elif fixin.loc[j, f"on_word_number_{algo_choice}"] < max(wordmem):
+            fixin.loc[j, f"forward_{algo_choice}"] = 0
+    for i in range(len(fixin) - 3):
+        if fixin.loc[i, f"line_change_{algo_choice}"] > 0:
+            fixin.loc[i, "on_word_number"] = 0
+            fixin.loc[i + 1, f"forward_{algo_choice}"] = 1
+            fixin.loc[i + 2, f"forward_{algo_choice}"] = 1
+            fixin.loc[i + 3, f"forward_{algo_choice}"] = 1
+    for i in range(1, len(fixin) - 3):
+        if fixin.loc[i, "on_sentence_num2"] > fixin.loc[i - 1, "on_sentence_num2"]:
+            fixin.loc[i + 1, f"forward_{algo_choice}"] = 1
+            fixin.loc[i + 2, f"forward_{algo_choice}"] = 1
+    fixin["id2"] = fixin["id"] + ":" + fixin[f"sent_runid2_{algo_choice}"].astype(str)
+    fixin = fixin.sort_values(["trial_id", "fixation_number"])
+    sent = fixin.copy().drop_duplicates(subset="id", keep="first")
+    names = [
+        "id",
+        "subject",
+        "trial_id",
+        "item",
+        "condition",
+        "on_sentence_num2",
+        f"on_sentence_num_{algo_choice}",
+        f"on_sentence_{algo_choice}",
+        "num_words_in_sentence",
+    ]
+    sent = sent[names].reset_index(drop=True)
+    sent[f"firstrun_skip_{algo_choice}"] = 0
+    mem = []
+    for j in range(len(sent)):
+        if not pd.isna(sent.loc[j, f"on_sentence_num_{algo_choice}"]):
+            if len(mem) > 0 and sent.loc[j, f"on_sentence_num_{algo_choice}"] < max(mem) and not pd.isna(max(mem)):
+                sent.loc[j, f"firstrun_skip_{algo_choice}"] = 1
+        if (
+            not pd.isna(sent.loc[j, f"on_sentence_num_{algo_choice}"])
+            and sent.loc[j, f"on_sentence_num_{algo_choice}"] not in mem
+        ):
+            mem.append(sent.loc[j, f"on_sentence_num_{algo_choice}"])
+    if "total_n_fixations" in measures_to_calc:
+        tmp = fixin.groupby("id")["duration"].count().reset_index()
+        tmp.columns = ["id", f"total_n_fixations_{algo_choice}"]
+        sent = pd.merge(sent, tmp, on="id", how="left")
+        sent.fillna({f"total_n_fixations_{algo_choice}": 0}, inplace=True)
+    tmp = fixin.groupby("id")["duration"].sum().reset_index()
+    tmp.columns = ["id", f"total_dur_{algo_choice}"]
+    sent = pd.merge(sent, tmp, on="id", how="left")
+    sent.fillna({f"total_dur_{algo_choice}": 0}, inplace=True)
+    if "firstpass_n_fixations" in measures_to_calc:
+        tmp = fixin[fixin[f"firstpass_{algo_choice}"] == 1].groupby("id")["duration"].count().reset_index()
+        tmp.columns = ["id", f"firstpass_n_fixations_{algo_choice}"]
+        sent = pd.merge(sent, tmp, on="id", how="left")
+        sent.fillna({f"firstpass_n_fixations_{algo_choice}": 0}, inplace=True)
+    if "firstpass_dur" in measures_to_calc:
+        tmp = fixin[fixin[f"firstpass_{algo_choice}"] == 1].groupby("id")["duration"].sum().reset_index()
+        tmp.columns = ["id", f"firstpass_dur_{algo_choice}"]
+        sent = pd.merge(sent, tmp, on="id", how="left")
+        sent.fillna({f"firstpass_dur_{algo_choice}": 0}, inplace=True)
+    if "firstpass_forward_n_fixations" in measures_to_calc:
+        tmp = (
+            fixin[(fixin[f"firstpass_{algo_choice}"] == 1) & (fixin[f"forward_{algo_choice}"] == 1)]
+            .groupby("id")["duration"]
+            .count()
+            .reset_index()
+        )
+        tmp.columns = ["id", f"firstpass_forward_n_fixations_{algo_choice}"]
+        sent = pd.merge(sent, tmp, on="id", how="left")
+        sent.fillna({f"firstpass_forward_n_fixations_{algo_choice}": 0}, inplace=True)
+    if "firstpass_forward_dur" in measures_to_calc:
+        tmp = (
+            fixin[(fixin[f"firstpass_{algo_choice}"] == 1) & (fixin[f"forward_{algo_choice}"] == 1)]
+            .groupby("id")["duration"]
+            .sum()
+            .reset_index()
+        )
+        tmp.columns = ["id", f"firstpass_forward_dur_{algo_choice}"]
+        sent = pd.merge(sent, tmp, on="id", how="left")
+        sent.fillna({f"firstpass_forward_dur_{algo_choice}": 0}, inplace=True)
+    if "firstpass_reread_n_fixations" in measures_to_calc:
+        tmp = (
+            fixin[(fixin[f"firstpass_{algo_choice}"] == 1) & (fixin[f"forward_{algo_choice}"] == 0)]
+            .groupby("id")["duration"]
+            .count()
+            .reset_index()
+        )
+        tmp.columns = ["id", f"firstpass_reread_n_fixations_{algo_choice}"]
+        sent = pd.merge(sent, tmp, on="id", how="left")
+        sent.fillna({f"firstpass_reread_n_fixations_{algo_choice}": 0}, inplace=True)
+    if "firstpass_reread_dur" in measures_to_calc:
+        tmp = (
+            fixin[(fixin[f"firstpass_{algo_choice}"] == 1) & (fixin[f"forward_{algo_choice}"] == 0)]
+            .groupby("id")["duration"]
+            .sum()
+            .reset_index()
+        )
+        tmp.columns = ["id", f"firstpass_reread_dur_{algo_choice}"]
+        sent = pd.merge(sent, tmp, on="id", how="left")
+        sent.fillna({f"firstpass_reread_dur_{algo_choice}": 0}, inplace=True)
+    if sum(fixin[f"firstpass_{algo_choice}"] == 0) != 0:
+        if "lookback_n_fixations" in measures_to_calc:
+            tmp = fixin[fixin[f"firstpass_{algo_choice}"] == 0].groupby("id")["duration"].count().reset_index()
+            tmp.columns = ["id", f"lookback_n_fixations_{algo_choice}"]
+            sent = pd.merge(sent, tmp, on="id", how="left")
+            sent.fillna({f"lookback_n_fixations_{algo_choice}": 0}, inplace=True)
+        if "lookback_dur" in measures_to_calc:
+            tmp = fixin[fixin[f"firstpass_{algo_choice}"] == 0].groupby("id")["duration"].sum().reset_index()
+            tmp.columns = ["id", f"lookback_dur_{algo_choice}"]
+            sent = pd.merge(sent, tmp, on="id", how="left")
+            sent.fillna({f"lookback_dur_{algo_choice}": 0}, inplace=True)
+        fixin["id2"] = fixin.apply(lambda row: f"{row['id']}:{row[f'sent_runid2_{algo_choice}']}", axis=1)
+        sent2 = fixin.drop_duplicates(subset="id2", keep="first")
+        sent3 = sent2[(sent2[f"firstpass_{algo_choice}"] == 0) & (~pd.isna(sent2[f"sent_reg_in_from2_{algo_choice}"]))]
+        tmp = fixin[fixin["id2"].isin(sent3["id2"])].groupby("id")["duration"].count().reset_index()
+        tmp.columns = ["id", f"lookfrom_n_fixations_{algo_choice}"]
+        tmp2 = pd.merge(tmp, sent3)
+        tmp3 = tmp2.groupby("last")[f"lookfrom_n_fixations_{algo_choice}"].sum().reset_index()
+        tmp3.columns = ["last", f"lookfrom_n_fixations_{algo_choice}"]
+        sent = pd.merge(sent, tmp3, left_on="id", right_on="last", how="left")
+        sent.fillna({f"lookfrom_n_fixations_{algo_choice}": 0}, inplace=True)
+        if "lookfrom_dur" in measures_to_calc:
+            tmp = fixin[fixin["id2"].isin(sent3["id2"])].groupby("id")["duration"].sum().reset_index()
+            tmp.columns = ["id", f"lookfrom_dur_{algo_choice}"]
+            tmp2 = pd.merge(tmp, sent3)
+            tmp3 = tmp2.groupby("last")[f"lookfrom_dur_{algo_choice}"].sum().reset_index()
+            tmp3.columns = ["last", f"lookfrom_dur_{algo_choice}"]
+            sent = pd.merge(sent, tmp3, left_on="id", right_on="last", how="left")
+            sent.fillna({f"lookfrom_dur_{algo_choice}": 0}, inplace=True)
+    # Firstrun
+    firstruntmp = fixin[fixin[f"sentence_run_{algo_choice}"] == 1]
+    if "firstrun_reg_in" in measures_to_calc:
+        tmp = firstruntmp.groupby("id")[f"sent_reg_in2_{algo_choice}"].max().reset_index()
+        tmp.columns = ["id", f"firstrun_reg_in_{algo_choice}"]
+        sent = pd.merge(sent, tmp, on="id", how="left")
+        sent.fillna({f"firstrun_reg_in_{algo_choice}": 0}, inplace=True)
+    if "firstrun_reg_out" in measures_to_calc:
+        tmp = firstruntmp.groupby("id")[f"sent_reg_out2_{algo_choice}"].max().reset_index()
+        tmp.columns = ["id", f"firstrun_reg_out_{algo_choice}"]
+        sent = pd.merge(sent, tmp, on="id", how="left")
+        sent.fillna({f"firstrun_reg_out_{algo_choice}": 0}, inplace=True)
+    # Complete sentence
+    gopasttmp = fixin.copy()
+    gopasttmp[f"on_sentence_num_{algo_choice}"] = gopasttmp["on_sentence_num2"]
+    tmp = compute_gopast_sentence(gopasttmp, algo_choice)
+    names = ["id", f"gopast_{algo_choice}", f"selgopast_{algo_choice}"]
+    tmp = tmp[names]
+    tmp = tmp.drop_duplicates(subset="id", keep="first")
+    tmp.columns = ["id", f"gopast_{algo_choice}", f"gopast_sel_{algo_choice}"]
+    sent = pd.merge(sent, tmp, on="id", how="left")
+    # Nrun
+    tmp = fixin.groupby("id")[f"sentence_run_{algo_choice}"].max().reset_index()
+    tmp.columns = ["id", f"nrun_{algo_choice}"]
+    sent = pd.merge(sent, tmp, on="id", how="left")
+    # Reread
+    sent[f"reread_{algo_choice}"] = sent.apply(lambda row: 1 if row[f"nrun_{algo_choice}"] > 1 else 0, axis=1)
+    # Reg_in
+    tmp = fixin.groupby("id")[f"sent_reg_in2_{algo_choice}"].max().reset_index()
+    tmp.columns = ["id", f"reg_in_{algo_choice}"]
+    sent = pd.merge(sent, tmp, on="id", how="left")
+    # Reg_out
+    tmp = fixin.groupby("id")[f"sent_reg_out2_{algo_choice}"].max().reset_index()
+    tmp.columns = ["id", f"reg_out_{algo_choice}"]
+    sent = pd.merge(sent, tmp, on="id", how="left")
+    sent = sent.sort_values(by=f"on_sentence_num_{algo_choice}").reset_index(drop=True)
+    # Rate
+    sent[f"rate_{algo_choice}"] = round(60000 / (sent[f"total_dur_{algo_choice}"] / sent["num_words_in_sentence"]))
+    # Write out
+    item = sentitem.copy()
+    sent = pd.merge(
+        sent,
+        item.rename({"in_sentence_number": f"on_sentence_num_{algo_choice}"}, axis=1),
+        on=f"on_sentence_num_{algo_choice}",
+        how="left",
+    )
+    sent[f"skip_{algo_choice}"] = 0
+    sent.loc[pd.isna(sent[f"nrun_{algo_choice}"]), f"skip_{algo_choice}"] = 1
+    names = [
+        "subject",
+        "trial_id",
+        "item",
+        "condition",
+    ] + [
+        c
+        for c in [
+            f"on_sentence_num_{algo_choice}",
+            f"on_sentence_{algo_choice}",
+            "num_words_in_sentence",
+            f"skip_{algo_choice}",
+            f"nrun_{algo_choice}",
+            f"reread_{algo_choice}",
+            f"reg_in_{algo_choice}",
+            f"reg_out_{algo_choice}",
+            f"total_n_fixations_{algo_choice}",
+            f"total_dur_{algo_choice}",
+            f"rate_{algo_choice}",
+            f"gopast_{algo_choice}",
+            f"gopast_sel_{algo_choice}",
+            f"firstrun_skip_{algo_choice}",
+            f"firstrun_reg_in_{algo_choice}",
+            f"firstrun_reg_out_{algo_choice}",
+            f"firstpass_n_fixations_{algo_choice}",
+            f"firstpass_dur_{algo_choice}",
+            f"firstpass_forward_n_fixations_{algo_choice}",
+            f"firstpass_forward_dur_{algo_choice}",
+            f"firstpass_reread_n_fixations_{algo_choice}",
+            f"firstpass_reread_dur_{algo_choice}",
+            f"lookback_n_fixations_{algo_choice}",
+            f"lookback_dur_{algo_choice}",
+            f"lookfrom_n_fixations_{algo_choice}",
+            f"lookfrom_dur_{algo_choice}",
+        ]
+        if (c in sent.columns and c.replace(f"_{algo_choice}", "") in measures_to_calc)
+    ]
+    sent = sent[names].copy()
+    sent.rename(
+        {
+            f"on_sentence_num_{algo_choice}": "sentence_number",
+            f"on_sentence_{algo_choice}": "sentence",
+            "num_words_in_sentence": "number_of_words",
+        },
+        axis=1,
+        inplace=True,
+    )
+    if save_to_csv:
+        subj = fix["subject"].iloc[0]
+        trial_id = fix["trial_id"].iloc[0]
+        sent.to_csv(RESULTS_FOLDER / f"{subj}_{trial_id}_{algo_choice}_sentence_measures.csv")
+    return sent.copy()
+def compute_gopast_sentence(fixin, algo_choice):
+    # create response vectors
+    fixin[f"gopast_{algo_choice}"] = np.nan
+    fixin[f"selgopast_{algo_choice}"] = np.nan
+    # compute trialid within person
+    ias = fixin[f"on_sentence_num_{algo_choice}"].unique()
+    # compute measures
+    for j in ias:
+        min_fixation_number_j = fixin.loc[fixin[f"on_sentence_num_{algo_choice}"] == j, "fixation_number"].min(
+            skipna=True
+        )
+        next_min_fixation_number = (
+            fixin.loc[fixin[f"on_sentence_num_{algo_choice}"] > j, "fixation_number"].min(skipna=True)
+            if j != ias[-1]
+            else float("inf")
+        )
+        mask = (
+            (fixin["fixation_number"] >= min_fixation_number_j)
+            & (fixin["fixation_number"] < next_min_fixation_number)
+            & (~fixin[f"on_sentence_num_{algo_choice}"].isna())
+        )
+        fixin.loc[fixin[f"on_sentence_num_{algo_choice}"] == j, f"gopast_{algo_choice}"] = fixin.loc[
+            mask, "duration"
+        ].sum(skipna=True)
+        mask_j = (
+            (fixin["fixation_number"] >= min_fixation_number_j)
+            & (fixin["fixation_number"] < next_min_fixation_number)
+            & (~fixin[f"on_sentence_num_{algo_choice}"].isna())
+            & (fixin[f"on_sentence_num_{algo_choice}"] == j)
+        )
+        fixin.loc[fixin[f"on_sentence_num_{algo_choice}"] == j, f"selgopast_{algo_choice}"] = fixin.loc[
+            mask_j, "duration"
+        ].sum(skipna=True)
+    return fixin
+def aggregate_trials(dffix_combined, wordcomb, all_trials_by_subj, algo_choices):
+    tmp = dffix_combined.copy()
+    trial = tmp.drop_duplicates(subset="subject_trialID", keep="first")
+    names = ["subject_trialID", "subject", "trial_id", "item", "condition"]
+    trial = trial[names].copy()
+    for index, row in trial.iterrows():
+        selected_trial = all_trials_by_subj[row["subject"]][row["trial_id"]]
+        info_keys = [
+            k for k in selected_trial.keys() if k in ["trial_start_time", "trial_end_time", "question_correct"]
+        ]
+        if row["subject"] in all_trials_by_subj and row["trial_id"] in all_trials_by_subj[row["subject"]]:
+            if selected_trial["Fixation Cleaning Stats"]["Discard fixation before or after blinks"]:
+                trial.at[index, "blink"] = selected_trial["Fixation Cleaning Stats"][
+                    "Number of discarded fixations due to blinks"
+                ]
+            for key, value in selected_trial.items():
+                if key in info_keys:
+                    trial.at[index, key] = value
+    subdf = wordcomb.copy().loc[:, ["subject_trialID"]].drop_duplicates(subset=["subject_trialID"], keep="first")
+    trial = pd.merge(trial, subdf, on="subject_trialID", how="left")
+    for sub, subdf in wordcomb.groupby("subject"):
+        for trialid, trialdf in subdf.groupby("trial_id"):
+            trial.loc[((trial["subject"] == sub) & (trial["trial_id"] == trialid)), "number_of_words_in_trial"] = (
+                trialdf["word"].count()
+            )
+    trial.sort_values(by="subject_trialID", inplace=True)
+    if "blink" in tmp.columns:
+        blink = tmp.groupby("subject_trialID")["blink"].sum() / 2
+        blink = blink.round().reset_index()
+        trial = pd.merge(trial, blink, on="subject_trialID", how="left")
+    trial["nfix"] = tmp.groupby("subject_trialID")["fixation_number"].agg("count").values
+    new_col_dfs = []
+    new_col_dfs.append(tmp.groupby("subject_trialID")["duration"].agg("mean").reset_index(name="mean_fix_duration"))
+    new_col_dfs.append(tmp.groupby("subject_trialID")["duration"].agg("sum").reset_index(name="total_fix_duration"))
+    for algo_choice in algo_choices:
+        new_col_dfs.append(
+            tmp.groupby("subject_trialID")[f"word_runid_{algo_choice}"]
+            .agg("max")
+            .reset_index(name=f"nrun_{algo_choice}")
+        )
+        tmp[f"saccade_length_{algo_choice}"] = tmp[f"word_land_{algo_choice}"] + tmp[f"word_launch_{algo_choice}"]
+        new_col_dfs.append(
+            tmp[(tmp[f"saccade_length_{algo_choice}"] >= 0) & tmp[f"saccade_length_{algo_choice}"].notna()]
+            .groupby("subject_trialID")[f"saccade_length_{algo_choice}"]
+            .agg("mean")
+            .reset_index(name=f"saccade_length_{algo_choice}")
+        )
+        word = wordcomb.copy()
+        if f"firstrun_skip_{algo_choice}" in wordcomb.columns:
+            new_col_dfs.append(
+                word.groupby("subject_trialID")[f"firstrun_skip_{algo_choice}"]
+                .agg("mean")
+                .reset_index(name=f"skip_{algo_choice}")
+            )
+        if f"refix_{algo_choice}" in wordcomb.columns:
+            new_col_dfs.append(
+                word.groupby("subject_trialID")[f"refix_{algo_choice}"]
+                .agg("mean")
+                .reset_index(name=f"refix_{algo_choice}")
+            )
+        if f"reg_in_{algo_choice}" in wordcomb.columns:
+            new_col_dfs.append(
+                word.groupby("subject_trialID")[f"reg_in_{algo_choice}"]
+                .agg("mean")
+                .reset_index(name=f"reg_{algo_choice}")
+            )
+        if f"firstrun_dur_{algo_choice}" in wordcomb.columns:
+            new_col_dfs.append(
+                word.groupby("subject_trialID")[f"firstrun_dur_{algo_choice}"]
+                .agg("sum")
+                .reset_index(name=f"firstpass_{algo_choice}")
+            )
+        if f"total_fixation_duration_{algo_choice}" in wordcomb.columns:
+            new_col_dfs.append(
+                (word[f"total_fixation_duration_{algo_choice}"] - word[f"firstrun_dur_{algo_choice}"])
+                .groupby(word["subject_trialID"])
+                .agg("sum")
+                .reset_index(name=f"rereading_{algo_choice}")
+            )
+    trial = pd.concat(
+        [trial.set_index("subject_trialID")] + [df.set_index("subject_trialID") for df in new_col_dfs], axis=1
+    ).reset_index()
+    trial[f"reading_rate_{algo_choice}"] = (
+        60000 / (trial["total_fix_duration"] / trial["number_of_words_in_trial"])
+    ).round()
+    return trial.copy()
+def aggregate_subjects(trials, algo_choices):
+    trial_aggregates = trials.groupby("subject")[["nfix", "blink"]].mean().round(3).reset_index()
+    trial_aggregates = trial_aggregates.merge(
+        trials.groupby("subject")["question_correct"].sum().reset_index(name="n_question_correct"), on="subject"
+    )
+    trial_aggregates = trial_aggregates.merge(
+        trials.groupby("subject")["trial_id"].count().reset_index(name="ntrial"), on="subject"
+    )
+    for algo_choice in algo_choices:
+        cols_to_do = [
+            c
+            for c in [
+                f"saccade_length_{algo_choice}",
+                f"reg_{algo_choice}",
+                f"mean_fix_duration_{algo_choice}",
+                f"total_fix_duration_{algo_choice}",
+                f"reading_rate_{algo_choice}",
+                f"refix_{algo_choice}",
+                f"nrun_{algo_choice}",
+                f"skip_{algo_choice}",
+            ]
+            if c in trials.columns
+        ]
+        trial_aggregates_temp = trials.groupby("subject")[cols_to_do].mean().round(3).reset_index()
+        trial_aggregates = pd.merge(trial_aggregates, trial_aggregates_temp, how="left", on="subject")
+    return trial_aggregates

process_asc_files_in_multi_p.py ADDED Viewed

	@@ -0,0 +1,149 @@

+from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
+import json
+from sys import platform as _platform
+from functools import partial
+import multiprocessing
+import os
+from tqdm.auto import tqdm
+from multi_proc_funcs import DIST_MODELS_FOLDER, process_trial_choice, set_up_models
+import sys
+import pandas as pd
+def get_cpu_count():
+    if os.sys.platform in ("linux", "linux2", "darwin"):
+        return os.cpu_count()
+    elif os.sys.platform == "win32":
+        return multiprocessing.cpu_count()
+    else:
+        return 1
+def process_asc_files_in_multi_proc(
+    algo_choice,
+    choice_handle_short_and_close_fix,
+    discard_fixations_without_sfix,
+    discard_far_out_of_text_fix,
+    x_thres_in_chars,
+    y_thresh_in_heights,
+    short_fix_threshold,
+    merge_distance_threshold,
+    discard_long_fix,
+    discard_long_fix_threshold,
+    discard_blinks,
+    measures_to_calculate_multi_asc,
+    include_coords_multi_asc,
+    sent_measures_to_calculate_multi_asc,
+    trials_by_ids,
+    classic_algos_cfg,
+    models_dict,
+    fix_cols_to_add_multi_asc,
+):
+    funcc = partial(
+        process_trial_choice,
+        algo_choice=algo_choice,
+        choice_handle_short_and_close_fix=choice_handle_short_and_close_fix,
+        for_multi=True,
+        discard_fixations_without_sfix=discard_fixations_without_sfix,
+        discard_far_out_of_text_fix=discard_far_out_of_text_fix,
+        x_thres_in_chars=x_thres_in_chars,
+        y_thresh_in_heights=y_thresh_in_heights,
+        short_fix_threshold=short_fix_threshold,
+        merge_distance_threshold=merge_distance_threshold,
+        discard_long_fix=discard_long_fix,
+        discard_long_fix_threshold=discard_long_fix_threshold,
+        discard_blinks=discard_blinks,
+        measures_to_calculate_multi_asc=measures_to_calculate_multi_asc,
+        include_coords_multi_asc=include_coords_multi_asc,
+        sent_measures_to_calculate_multi_asc=sent_measures_to_calculate_multi_asc,
+        classic_algos_cfg=classic_algos_cfg,
+        models_dict=models_dict,
+        fix_cols_to_add=fix_cols_to_add_multi_asc,
+    )
+    workers = min(len(trials_by_ids), 32, get_cpu_count() - 1)
+    with multiprocessing.Pool(workers) as pool:
+        out = pool.map(funcc, trials_by_ids.values())
+    return out
+def make_json_compatible(obj):
+    if isinstance(obj, dict):
+        return {k: make_json_compatible(v) for k, v in obj.items()}
+    elif isinstance(obj, list):
+        return [make_json_compatible(v) for v in obj]
+    elif isinstance(obj, pd.DataFrame):
+        return obj.to_dict(orient="records")
+    elif isinstance(obj, pd.Series):
+        return obj.to_dict()
+    else:
+        return obj
+def main():
+    try:
+        input_data = sys.stdin.buffer.read()
+        (
+            algo_choice,
+            choice_handle_short_and_close_fix,
+            discard_fixations_without_sfix,
+            discard_far_out_of_text_fix,
+            x_thres_in_chars,
+            y_thresh_in_heights,
+            short_fix_threshold,
+            merge_distance_threshold,
+            discard_long_fix,
+            discard_long_fix_threshold,
+            discard_blinks,
+            measures_to_calculate_multi_asc,
+            include_coords_multi_asc,
+            sent_measures_to_calculate_multi_asc,
+            trials_by_ids,
+            classic_algos_cfg,
+            models_dict,
+            fix_cols_to_add_multi_asc,
+        ) = json.loads(input_data)
+        if (
+            "DIST" in algo_choice
+            or "Wisdom_of_Crowds_with_DIST" in algo_choice
+            or "DIST-Ensemble" in algo_choice
+            or "Wisdom_of_Crowds_with_DIST_Ensemble" in algo_choice
+        ):
+            del models_dict  # Needed to stop pickling from failing for multiproc
+            models_dict = set_up_models(DIST_MODELS_FOLDER)
+        else:
+            models_dict = {}
+        out = process_asc_files_in_multi_proc(
+            algo_choice,
+            choice_handle_short_and_close_fix,
+            discard_fixations_without_sfix,
+            discard_far_out_of_text_fix,
+            x_thres_in_chars,
+            y_thresh_in_heights,
+            short_fix_threshold,
+            merge_distance_threshold,
+            discard_long_fix,
+            discard_long_fix_threshold,
+            discard_blinks,
+            measures_to_calculate_multi_asc,
+            include_coords_multi_asc,
+            sent_measures_to_calculate_multi_asc,
+            trials_by_ids,
+            classic_algos_cfg,
+            models_dict,
+            fix_cols_to_add_multi_asc,
+        )
+        out2 = []
+        for dffix, trial in out:
+            dffix = dffix.to_dict("records")
+            trial = make_json_compatible(trial)
+            out2.append((dffix, trial))
+        json_data_out = json.dumps(out2)
+        sys.stdout.flush()
+        print(json_data_out)
+    except Exception as e:
+        print(json.dumps({"error": str(e)}))
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,25 @@

+datasets
+einops
+matplotlib
+numpy
+pandas
+PyYAML
+seaborn
+tqdm
+transformers==4.*
+tensorboard
+torchmetrics
+pytorch-lightning
+scikit-learn
+plotly
+lovely-tensors
+timm
+openpyxl
+torch==2.*
+pydantic==1.10
+streamlit >= 1.35
+pycairo
+eyekit
+stqdm
+jellyfish
+icecream

saccades_df_columns.md ADDED Viewed

	@@ -0,0 +1,38 @@

+#### Column names for Saccades Dataframe
+Some features were adapted from the popEye R package ([github](https://github.com/sascha2schroeder/popEye))
+The if the column depend on a line assignment then a _ALGORITHM_NAME will be at the end of the name.
+- subject: Subject name or ID (derived from filename)
+- trial_id: Trial ID
+- item: Item ID
+- condition: Condition (if applicable)
+- num: Saccade number
+- start_time: Start time (in ms since start of the trial)
+- end_time: End time (in ms since start of the trial)
+- xs: Raw x start position (in pixel)
+- ys: Raw y start position (in pixel)
+- xe: Raw x end position (in pixel)
+- ye: Raw y end position (in pixel)
+- ampl: saccadic amplitude (degrees)
+- pv: peak velocity (degrees/sec)
+- start_uncorrected: Start time (in ms as recorded by EyeLink)
+- stop_uncorrected: End time (in ms as recorded by EyeLink)
+- blink_before: Whether a blink occured directly before the saccade
+- blink_after: Whether a blink occured directly after the saccade
+- blink: Whether a blink occured directly before or after the saccade
+- duration: Duration (in ms)
+- xe_minus_xs: Horizontal saccade distance
+- ye_minus_ys: Vertical saccade distance
+- eucledian_distance: Eucledian distance
+- angle: Angle
+- dX: Horizontal saccade amplitude
+- dY: Vertical saccade amplitude
+- ys_ALGORITHM_NAME: Corrected y start position (in pixel), i.e. after line assignment
+- ye_ALGORITHM_NAME: Corrected y end position (in pixel), i.e. after line assignment
+- ye_minus_ys_ALGORITHM_NAME: Vertical saccade distance after being snapped to line
+- angle_ALGORITHM_NAME: Vertical saccade distance after being snapped to line
+- lines_ALGORITHM_NAME: Starting line of saccade
+- linee_ALGORITHM_NAME: Landing line of saccade
+- line_word_s_ALGORITHM_NAME: Number of word on line from which saccade starts
+- line_word_e_ALGORITHM_NAME: Number of word on line where saccade ends
+- lets_ALGORITHM_NAME: Number of letter from which saccade starts
+- lete_ALGORITHM_NAME: Number of letter where saccade ends

sentence_measures.md ADDED Viewed

	@@ -0,0 +1,35 @@

+#### Column names for Sentence measures
+Some features were adapted from the popEye R package ([github](https://github.com/sascha2schroeder/popEye))
+The if the column depend on a line assignment then a _ALGORITHM_NAME will be at the end of the name.
+- subject: Participant ID
+- trial_id: Position of trial in analysis
+- item: Item ID
+- condition: Condition (if applicable)
+- sentence_number: Number of sentence in trial
+- sentence: Sentence Text
+- number_of_words: Number of words in sentence
+- skip: Whether the sentence has been skipped
+- nrun: Number of times the sentence has been read
+- reread: Whether the sentence has been read more than one time
+- reg_in: Whether a regression has been made into the sentence
+- reg_out: Whether a regression has been made out of the sentence
+- total_n_fixations: Number of fixations made on the sentence
+- total_dur: Total sentence reading time
+- rate: Reading rate (number of words per minute)
+- gopast: Sum of all fixations durations from the time the sentence was entered until it was left to the right (regression path duration)
+- gopast_sel: Sum of all fixations on the sentence from the time it was entered until it was left to the right (selective go-past time: regression path dur ation minus the time of the regression path)
+-  firstrun_skip: Whether sentence has been skipped during first-pass reading
+- firstrun_reg_in: Whether a regression has been made into the sentence during first-pass reading
+- firstrun_reg_out: Whether a regression has been made out of the sentence during first-pass reading
+- firstpass_n_fixations: Number of fixation made during first-pass reading
+- firstpass_dur: First-pass reading time
+- firstpass_forward_n_fixations: Number of first-pass forward fixations (landing on one of the upcoming words of a sentence)
+- firstpass_forward_dur: Duration of forward fixations during first-pass reading
+- firstpass_reread_n_fixations: Number of first-pass rereading fixations (landing one of the words of the sentence that have been read previously)
+- firstpass_reread_dur: Duration of rereading fixations during first-pass reading
+- lookback_n_fixations: Number of fixations made on the sentence after regressing into it from another sentence
+- lookback_dur: Duration of lookback fixations on the sentence
+- lookfrom_n_fixations: Number of rereading fixations on another sentence initiated from the sentence
+- lookfrom_dur: Duration of lookfrom fixations on the sentence
+The forward, rereading, look-back, and look-from measures are computed in similar way as in the SR "Getting Reading Measures" tool (https://www.sr-support.com/thread-350.html) which is based on the Eyelink Analysojia software (developed by the Turku Eye Labs).

subject_measures.md ADDED Viewed

	@@ -0,0 +1,15 @@

+#### Column names for Subject level summary statistics
+Some features were adapted from the popEye R package ([github](https://github.com/sascha2schroeder/popEye))
+The if the column depend on a line assignment then a _ALGORITHM_NAME will be at the end of the name.
+- subject: Subject identifier, taken from filename
+- ntrial: Number of trials for the subject
+- n_question_correct: Total number of correctly answered questions
+- blink: Mean number of blinks across trials
+- nfix: Mean number of fixations across trials
+- skip_ALGORITHM_NAME: Mean proportion of words that have been skipped during first-pass reading across trials
+- saccade_length_ALGORITHM_NAME: Mean (forward) saccade length
+- refix_ALGORITHM_NAME: Mean proportion of words that have been refixated across trials
+- reg_ALGORITHM_NAME: Mean proportion of words which have been regressed into across trials
+- mean_fixation_duration: Mean fixation duration
+- total_fix_duration: Mean total reading time across trials

trials_df_columns.md ADDED Viewed

	@@ -0,0 +1,36 @@

+#### Column names for Trials Dataframe
+Some features were adapted from the popEye R package ([github](https://github.com/sascha2schroeder/popEye))
+The if the column depend on a line assignment then a _ALGORITHM_NAME will be at the end of the name.
+- subject: Subject name or ID (derived from filename)
+- trial_id: Trial ID
+- item: Item ID
+- condition: Condition (if applicable)
+- average_y_correction_ALGORITHM_NAME: Average difference between raw y position of a fixation and the center of the line to which it was assigned in pixels
+- Number of fixations before cleaning: Number of fixation found for the trial before any cleaning is done
+- Discard long fixations: Indicates if overly long fixations were discarded
+- Number of discarded long fixations: Number of fixations that were discarded due to being overly long
+- Number of discarded long fixations (%): Number of fixations that were discarded due to being overly long as a percentage of the total number of fixations
+- How short and close fixations were handled: Which option was chosen for handling short fixation
+- Number of merged fixations: Number of fixations that were merged due to their duration being below the set threshold and being in horizontal proximity to their preceeding or subsequent fixation
+- Number of merged fixations (%): Number of fixations that were merged due to their duration being below the set threshold and being in horizontal proximity to their preceeding or subsequent fixation as a percentage of the total number of fixations
+- Far out of text fixations were discarded: Whether fixations were discarded if they were far outside the stimulus text
+- Number of discarded far-out-of-text fixations: Number of fixations that were discarded due to being far outside the stimulus text
+- Number of discarded far-out-of-text fixations (%): Number of fixations that were discarded due to being far outside the stimulus text as a percentage of the total number of fixations
+- Total number of discarded and merged fixations: Number of fixations that were cleaned up
+- Total number of discarded and merged fixations (%): Number of fixations that were cleaned up as a percentage of the total number of fixations
+- trial_start_time: Timestamp of the start of the trial
+- trial_end_time: Timestamp of the end of the trial
+- question_correct: Whether the question associated with the trial was answered correctly. This will be blank if it could not be determined
+- number_of_words_in_trial: Total number of words in the stimulus used for the trial
+- blink: Number of blinks detected during the trial
+- nfix: Number of fixations remaining after cleaning
+- nrun_ALGORITHM_NAME: Number of runs on trial
+- saccade_length_ALGORITHM_NAME: Average saccade length across the trial
+- mean_fix_duration_ALGORITHM_NAME: Average fixation duration across the trial
+- total_fix_duration_ALGORITHM_NAME: Total fixation duration across the trial
+- skip_ALGORITHM_NAME: Proportion of words in the trial that have been skipped during first-pass reading
+- refix_ALGORITHM_NAME: Proportion of words in the trial that have been refixated
+- reg_ALGORITHM_NAME: Proportion of words which have been regressed into
+- firstpass_ALGORITHM_NAME: First-pass reading time
+- rereading_ALGORITHM_NAME: Re-reading time (total reading time minus first-pass reading time)
+- reading_rate_ALGORITHM_NAME: Reading rate (words per minute)

utils.py ADDED Viewed

	@@ -0,0 +1,1349 @@

+import pickle
+from io import StringIO
+import re
+import zipfile
+import os
+import plotly.graph_objects as go
+from io import StringIO
+import numpy as np
+import pandas as pd
+from PIL import Image
+import json
+from matplotlib import pyplot as plt
+import pathlib as pl
+import matplotlib as mpl
+from streamlit.runtime.uploaded_file_manager import UploadedFile
+from tqdm.auto import tqdm
+import time
+import requests
+from icecream import ic
+from matplotlib import font_manager
+from multi_proc_funcs import (
+    COLORS,
+    PLOTS_FOLDER,
+    RESULTS_FOLDER,
+    add_boxes_to_ax,
+    add_text_to_ax,
+    matplotlib_plot_df,
+    save_trial_to_json,
+    sigmoid,
+)
+import emreading_funcs as emf
+ic.configureOutput(includeContext=True)
+TEMP_FIGURE_STIMULUS_PATH = PLOTS_FOLDER / "temp_matplotlib_plot_stimulus.png"
+all_fonts = [x.name for x in font_manager.fontManager.ttflist]
+mpl.use("agg")
+DIST_MODELS_FOLDER = pl.Path("models")
+IMAGENET_MEAN = [0.485, 0.456, 0.406]
+IMAGENET_STD = [0.229, 0.224, 0.225]
+PLOTS_FOLDER = pl.Path("plots")
+names_dict = {
+    "SSACC": {"Descr": "Start of Saccade", "Pattern": "SSACC <eye > <stime>"},
+    "ESACC": {
+        "Descr": "End of Saccade",
+        "Pattern": "ESACC <eye > <stime> <etime > <dur> <sxp > <syp> <exp > <eyp> <ampl > <pv >",
+    },
+    "SFIX": {"Descr": "Start of Fixation", "Pattern": "SFIX <eye > <stime>"},
+    "EFIX": {"Descr": "End of Fixation", "Pattern": "EFIX <eye > <stime> <etime > <dur> <axp > <ayp> <aps >"},
+    "SBLINK": {"Descr": "Start of Blink", "Pattern": "SBLINK <eye > <stime>"},
+    "EBLINK": {"Descr": "End of Blink", "Pattern": "EBLINK <eye > <stime> <etime > <dur>"},
+    "DISPLAY ON": {"Descr": "Actual start of Trial", "Pattern": "DISPLAY ON"},
+}
+metadata_strs = ["DISPLAY COORDS", "GAZE_COORDS", "FRAMERATE"]
+POPEYE_FIXATION_COLS_DICT = {
+    "start": "start_time",
+    "stop": "end_time",
+    "xs": "x",
+    "ys": "y",
+}
+EMREADING_COLS_DROPLIST = ["hasText", "char_trial"]
+EMREADING_COLS_DICT = {
+    "sub": "subject",
+    "item": "item",
+    "condition": "condition",
+    "SFIX": "start_time",
+    "EFIX": "end_time",
+    "xPos": "x",
+    "yPos": "y",
+    "fix_number": "fixation_number",
+    "fix_dur": "duration",
+    "wordID": "on_word_EM",
+    "outOfBnds": "out_of_bounds",
+    "outsideText": "out_of_text_area",
+}
+def download_url(url, target_filename):
+    max_retries = 4
+    for attempt in range(1, max_retries + 1):
+        try:
+            r = requests.get(url)
+            if r.status_code != 200:
+                ic(f"Download failed due to unsuccessful response from server: {r.status_code}")
+                return -1
+            open(target_filename, "wb").write(r.content)
+            return 0
+        except Exception as e:
+            if attempt < max_retries:
+                time.sleep(2 * attempt)
+                ic(f"Download failed due to an error; will try again in {attempt*2} seconds:", e)
+            else:
+                ic(f"Failed after all attempts ({url}). Error details:\n{e}")
+                return -1
+def asc_to_trial_ids(
+    asc_file, close_gap_between_words, paragraph_trials_only, ias_files, trial_start_keyword, end_trial_at_keyword
+):
+    asc_encoding = ["ISO-8859-15", "UTF-8"][0]
+    trials_dict, lines = file_to_trials_and_lines(
+        asc_file,
+        asc_encoding,
+        close_gap_between_words=close_gap_between_words,
+        paragraph_trials_only=paragraph_trials_only,
+        uploaded_ias_files=ias_files,
+        trial_start_keyword=trial_start_keyword,
+        end_trial_at_keyword=end_trial_at_keyword,
+    )
+    enum = (
+        trials_dict["paragraph_trials"]
+        if paragraph_trials_only and "paragraph_trials" in trials_dict.keys()
+        else range(trials_dict["max_trial_idx"])
+    )
+    trials_by_ids = {trials_dict[idx]["trial_id"]: trials_dict[idx] for idx in enum}
+    return trials_by_ids, lines, trials_dict
+def get_trials_list(
+    asc_file, close_gap_between_words, paragraph_trials_only, ias_files, trial_start_keyword, end_trial_at_keyword
+):
+    if hasattr(asc_file, "name"):
+        savename = pl.Path(asc_file.name).stem
+    else:
+        savename = pl.Path(asc_file).stem
+    trials_by_ids, lines, trials_dict = asc_to_trial_ids(
+        asc_file,
+        close_gap_between_words=close_gap_between_words,
+        paragraph_trials_only=paragraph_trials_only,
+        ias_files=ias_files,
+        trial_start_keyword=trial_start_keyword,
+        end_trial_at_keyword=end_trial_at_keyword,
+    )
+    trial_keys = list(trials_by_ids.keys())
+    savename = RESULTS_FOLDER / f"{savename}_metadata_overview.json"
+    offload_list = [
+        "gaze_df",
+        "dffix",
+        "chars_df",
+        "saccade_df",
+        "x_char_unique",
+        "line_heights",
+        "chars_list",
+        "words_list",
+        "dffix_sacdf_popEye",
+        "fixdf_popEye",
+        "saccade_df",
+        "sacdf_popEye",
+        "combined_df",
+        "events_df",
+    ]
+    trials_dict_cut_down = {}
+    for k_outer, v_outer in trials_dict.items():
+        if isinstance(v_outer, dict):
+            trials_dict_cut_down[k_outer] = {}
+            for prop, val in v_outer.items():
+                if prop not in offload_list:
+                    trials_dict_cut_down[k_outer][prop] = val
+        else:
+            trials_dict_cut_down[k_outer] = v_outer
+    save_trial_to_json(trials_dict_cut_down, savename=savename)
+    return trial_keys, trials_by_ids, lines, asc_file, trials_dict
+def calc_xdiff_ydiff(line_xcoords_no_pad, line_ycoords_no_pad, line_heights, allow_multiple_values=False):
+    x_diffs = np.unique(np.diff(line_xcoords_no_pad))
+    if len(x_diffs) == 1:
+        x_diff = x_diffs[0]
+    elif not allow_multiple_values:
+        x_diff = np.min(x_diffs)
+    else:
+        x_diff = x_diffs
+    if np.unique(line_ycoords_no_pad).shape[0] == 1:
+        return x_diff, line_heights[0]
+    y_diffs = np.unique(np.diff(line_ycoords_no_pad))
+    if len(y_diffs) == 1:
+        y_diff = y_diffs[0]
+    elif len(y_diffs) == 0:
+        y_diff = 0
+    elif not allow_multiple_values:
+        y_diff = np.min(y_diffs)
+    else:
+        y_diff = y_diffs
+    return np.round(x_diff, decimals=2), np.round(y_diff, decimals=2)
+def add_words(chars_list):
+    chars_list_reconstructed = []
+    words_list = []
+    sentence_list = []
+    sentence_start_idx = 0
+    sentence_num = 0
+    word_start_idx = 0
+    chars_df = pd.DataFrame(chars_list)
+    chars_df["char_width"] = chars_df.char_xmax - chars_df.char_xmin
+    word_dict = None
+    on_line_num = -1
+    line_change_on_next_char = False
+    num_chars = len(chars_list)
+    for idx, char_dict in enumerate(chars_list):
+        # check if line change will happen after current char
+        on_line_num = char_dict["assigned_line"]
+        if idx < num_chars - 1:
+            line_change_on_next_char = on_line_num != chars_list[idx + 1]["assigned_line"]
+        else:
+            line_change_on_next_char = False
+        chars_list_reconstructed.append(char_dict)
+        if char_dict["char"] in [" "] or len(chars_list_reconstructed) == len(chars_list) or line_change_on_next_char:
+            word_xmin = chars_list_reconstructed[word_start_idx]["char_xmin"]
+            if chars_list_reconstructed[-1]["char"] == " " and len(chars_list_reconstructed) != 1:
+                word_xmax = chars_list_reconstructed[-2]["char_xmax"]
+                word = "".join(
+                    [
+                        chars_list_reconstructed[idx]["char"]
+                        for idx in range(word_start_idx, len(chars_list_reconstructed) - 1)
+                    ]
+                )
+            elif len(chars_list_reconstructed) == 1:
+                word_xmax = chars_list_reconstructed[-1]["char_xmax"]
+                word = " "
+            else:
+                word = "".join(
+                    [
+                        chars_list_reconstructed[idx]["char"]
+                        for idx in range(word_start_idx, len(chars_list_reconstructed))
+                    ]
+                )
+                word_xmax = chars_list_reconstructed[-1]["char_xmax"]
+            word_ymin = chars_list_reconstructed[word_start_idx]["char_ymin"]
+            word_ymax = chars_list_reconstructed[word_start_idx]["char_ymax"]
+            word_x_center = round((word_xmax - word_xmin) / 2 + word_xmin, ndigits=2)
+            word_y_center = round((word_ymax - word_ymin) / 2 + word_ymin, ndigits=2)
+            word_length = len(word)
+            assigned_line = chars_list_reconstructed[word_start_idx]["assigned_line"]
+            word_dict = dict(
+                word_number=len(words_list),
+                word=word,
+                word_length=word_length,
+                word_xmin=word_xmin,
+                word_xmax=word_xmax,
+                word_ymin=word_ymin,
+                word_ymax=word_ymax,
+                word_x_center=word_x_center,
+                word_y_center=word_y_center,
+                assigned_line=assigned_line,
+            )
+            if len(word) > 0 and word != " ":
+                words_list.append(word_dict)
+            for cidx, char_dict in enumerate(chars_list_reconstructed[word_start_idx:]):
+                if char_dict["char"] == " ":
+                    char_dict["in_word_number"] = len(words_list)
+                    char_dict["in_word"] = " "
+                    char_dict["num_letters_from_start_of_word"] = 0
+                else:
+                    char_dict["in_word_number"] = len(words_list) - 1
+                    char_dict["in_word"] = word
+                    char_dict["num_letters_from_start_of_word"] = cidx
+            word_start_idx = idx + 1
+        if chars_list_reconstructed[-1]["char"] in [".", "!", "?"] or idx == (len(chars_list) - 1):
+            if idx != sentence_start_idx:
+                chars_df_temp = pd.DataFrame(chars_list_reconstructed[sentence_start_idx:])
+                line_texts = []
+                for sidx, subdf in chars_df_temp.groupby("assigned_line"):
+                    line_text = "_".join(subdf.char.values)
+                    line_text = line_text.replace("_ _", " ")
+                    line_text = line_text.replace("_", "")
+                    line_texts.append(line_text.strip())
+                sentence_text = " ".join(line_texts)
+                sentence_dict = dict(sentence_num=sentence_num, sentence_text=sentence_text)
+                sentence_list.append(sentence_dict)
+                for c in chars_list_reconstructed[sentence_start_idx:]:
+                    c["in_sentence_number"] = sentence_num
+                    c["in_sentence"] = sentence_text
+                sentence_start_idx = len(chars_list_reconstructed)
+                sentence_num += 1
+            else:
+                sentence_list[-1]["sentence_text"] += chars_list_reconstructed[sentence_start_idx]["char"]
+                chars_list_reconstructed[idx]["in_sentence_number"] = sentence_list[-1]["sentence_num"]
+                chars_list_reconstructed[idx]["in_sentence"] = sentence_list[-1]["sentence_text"]
+    for cidx, char_dict in enumerate(chars_list_reconstructed):
+        if (
+            char_dict["char"] == " "
+            and (cidx + 1) < len(chars_list_reconstructed)
+            and char_dict["assigned_line"] == chars_list_reconstructed[cidx + 1]["assigned_line"]
+        ):
+            char_dict["in_word_number"] = chars_list_reconstructed[cidx + 1]["in_word_number"]
+            char_dict["in_word"] = chars_list_reconstructed[cidx + 1]["in_word"]
+    last_letter_in_word = words_list[-1]["word"][-1]
+    last_letter_in_chars_list_reconstructed = char_dict["char"]
+    if last_letter_in_word != last_letter_in_chars_list_reconstructed:
+        if last_letter_in_chars_list_reconstructed in [".", "!", "?"]:
+            words_list[-1] = dict(
+                word_number=len(words_list),
+                word=words_list[-1]["word"] + char_dict["char"],
+                word_length=len(words_list[-1]["word"] + char_dict["char"]),
+                word_xmin=words_list[-1]["word_xmin"],
+                word_xmax=char_dict["char_xmax"],
+                word_ymin=words_list[-1]["word_ymin"],
+                word_ymax=words_list[-1]["word_ymax"],
+                assigned_line=assigned_line,
+            )
+            word_x_center = round(
+                (words_list[-1]["word_xmax"] - words_list[-1]["word_xmin"]) / 2 + words_list[-1]["word_xmin"], ndigits=2
+            )
+            word_y_center = round(
+                (words_list[-1]["word_ymax"] - word_dict["word_ymin"]) / 2 + words_list[-1]["word_ymin"], ndigits=2
+            )
+            words_list[-1]["word_x_center"] = word_x_center
+            words_list[-1]["word_y_center"] = word_y_center
+        else:
+            word_dict = dict(
+                word_number=len(words_list),
+                word=char_dict["char"],
+                word_length=1,
+                word_xmin=char_dict["char_xmin"],
+                word_xmax=char_dict["char_xmax"],
+                word_ymin=char_dict["char_ymin"],
+                word_ymax=char_dict["char_ymax"],
+                word_x_center=char_dict["char_x_center"],
+                word_y_center=char_dict["char_y_center"],
+                assigned_line=assigned_line,
+            )
+            words_list.append(word_dict)
+        chars_list_reconstructed[-1]["in_word_number"] = len(words_list) - 1
+        chars_list_reconstructed[-1]["in_word"] = word_dict["word"]
+        chars_list_reconstructed[-1]["num_letters_from_start_of_word"] = 0
+        if len(sentence_list) > 0:
+            chars_list_reconstructed[-1]["in_sentence_number"] = sentence_num - 1
+            chars_list_reconstructed[-1]["in_sentence"] = sentence_list[-1]["sentence_text"]
+        else:
+            ic(f"Warning Sentence list empty: {sentence_list}")
+    return words_list, chars_list_reconstructed
+def read_ias_file(ias_file, prefix):
+    if isinstance(ias_file, UploadedFile):
+        lines = StringIO(ias_file.getvalue().decode("utf-8")).readlines()
+        ias_dicts = []
+        for l in lines:
+            lsplit = l.strip().split("\t")
+            ldict = {
+                f"{prefix}_number": float(lsplit[1]),
+                f"{prefix}_xmin": float(lsplit[2]),
+                f"{prefix}_xmax": float(lsplit[4]),
+                f"{prefix}_ymin": float(lsplit[3]),
+                f"{prefix}_ymax": float(lsplit[5]),
+                prefix: lsplit[6],
+            }
+            ias_dicts.append(ldict)
+        ias_df = pd.DataFrame(ias_dicts)
+    else:
+        ias_df = pd.read_csv(ias_file, delimiter="\t", header=None)
+        ias_df = ias_df.rename(
+            {
+                1: f"{prefix}_number",
+                2: f"{prefix}_xmin",
+                4: f"{prefix}_xmax",
+                3: f"{prefix}_ymin",
+                5: f"{prefix}_ymax",
+                6: prefix,
+            },
+            axis=1,
+        )
+    first_line_df = ias_df[ias_df[f"{prefix}_ymin"] == ias_df.loc[0, f"{prefix}_ymin"]]
+    words_include_spaces = (
+        first_line_df[f"{prefix}_xmax"].values == first_line_df[f"{prefix}_xmin"].shift(-1).values
+    ).any()
+    ias_df[f"{prefix}_width"] = ias_df[f"{prefix}_xmax"] - ias_df[f"{prefix}_xmin"]
+    if words_include_spaces:
+        ias_df[f"{prefix}_length"] = ias_df[prefix].map(lambda x: len(x) + 1)
+        ias_df[f"{prefix}_width_per_length"] = ias_df[f"{prefix}_width"] / ias_df[f"{prefix}_length"]
+        ias_df[f"{prefix}_xmax"] = (ias_df[f"{prefix}_xmax"] - ias_df[f"{prefix}_width_per_length"]).round(2)
+    ias_df[f"{prefix}_x_center"] = (
+        (ias_df[f"{prefix}_xmax"] - ias_df[f"{prefix}_xmin"]) / 2 + ias_df[f"{prefix}_xmin"]
+    ).round(2)
+    ias_df[f"{prefix}_y_center"] = (
+        (ias_df[f"{prefix}_ymax"] - ias_df[f"{prefix}_ymin"]) / 2 + ias_df[f"{prefix}_ymin"]
+    ).round(2)
+    unique_midlines = list(np.unique(ias_df[f"{prefix}_y_center"]))
+    assigned_lines = [unique_midlines.index(x) for x in ias_df[f"{prefix}_y_center"]]
+    ias_df["assigned_line"] = assigned_lines
+    ias_df[f"{prefix}_number"] = np.arange(ias_df.shape[0])
+    return ias_df
+def get_chars_list_from_words_list(ias_df, prefix="word"):
+    ias_df.reset_index(inplace=True, drop=True)
+    unique_midlines = list(np.unique(ias_df[f"{prefix}_y_center"]))
+    chars_list = []
+    for (idx, row), (next_idx, next_row) in zip(ias_df.iterrows(), ias_df.shift(-1).iterrows()):
+        word = str(row[prefix])
+        letter_width = (row[f"{prefix}_xmax"] - row[f"{prefix}_xmin"]) / len(word)
+        for i_w, letter in enumerate(word):
+            char_dict = dict(
+                in_word_number=idx,
+                in_word=word,
+                char_xmin=round(row[f"{prefix}_xmin"] + i_w * letter_width, 2),
+                char_xmax=round(row[f"{prefix}_xmin"] + (i_w + 1) * letter_width, 2),
+                char_ymin=row[f"{prefix}_ymin"],
+                char_ymax=row[f"{prefix}_ymax"],
+                char=letter,
+            )
+            char_dict["char_x_center"] = round(
+                (char_dict["char_xmax"] - char_dict["char_xmin"]) / 2 + char_dict["char_xmin"], ndigits=2
+            )
+            char_dict["char_y_center"] = round(
+                (row[f"{prefix}_ymax"] - row[f"{prefix}_ymin"]) / 2 + row[f"{prefix}_ymin"], ndigits=2
+            )
+            if i_w >= len(word) + 1:
+                break
+            char_dict["assigned_line"] = unique_midlines.index(char_dict["char_y_center"])
+            chars_list.append(char_dict)
+        if chars_list[-1]["char"] != " " and row.assigned_line == next_row.assigned_line:
+            char_dict = dict(
+                char_xmin=chars_list[-1]["char_xmax"],
+                char_xmax=round(chars_list[-1]["char_xmax"] + letter_width, 2),
+                char_ymin=row[f"{prefix}_ymin"],
+                char_ymax=row[f"{prefix}_ymax"],
+                char=" ",
+            )
+            char_dict["char_x_center"] = round(
+                (char_dict["char_xmax"] - char_dict["char_xmin"]) / 2 + char_dict["char_xmin"], ndigits=2
+            )
+            char_dict["char_y_center"] = round(
+                (row[f"{prefix}_ymax"] - row[f"{prefix}_ymin"]) / 2 + row[f"{prefix}_ymin"], ndigits=2
+            )
+            char_dict["assigned_line"] = unique_midlines.index(char_dict["char_y_center"])
+            chars_list.append(char_dict)
+    chars_df = pd.DataFrame(chars_list)
+    chars_df.loc[:, ["in_word_number", "in_word"]] = chars_df.loc[:, ["in_word_number", "in_word"]].copy().ffill(axis=0)
+    return chars_df.to_dict("records")
+def check_values(v1, v2):
+    """Function that compares two lists for equality.
+    Returns True if both lists are the same; False if they are not; and None if either is None."""
+    # Check if any of the lists is None
+    if v1 is None or v2 is None or pd.isna(v1) or pd.isna(v2):
+        return None
+    # Compare elements in v1 with corresponding elements in v2
+    if v1 != v2:
+        return False
+    if v1 != v2:
+        return False
+    return True
+def asc_lines_to_trials_by_trail_id(
+    lines: list,
+    paragraph_trials_only=True,
+    filename: str = "",
+    close_gap_between_words=True,
+    ias_files=[],
+    start_trial_at_keyword="START",
+    end_trial_at_keyword="END",
+) -> dict:
+    if len(ias_files) > 0:
+        ias_files_dict = {pl.Path(f.name).stem: f for f in ias_files}
+    else:
+        ias_files_dict = {}
+    if hasattr(filename, "name"):
+        filename = filename.name
+    subject = pl.Path(filename).stem
+    y_px = []
+    x_px = []
+    calibration_offset = []
+    calibration_max_error = []
+    calibration_time = []
+    calibration_avg_error = []
+    trial_var_block_lines = None
+    question_answer = None
+    question_correct = None
+    condition = "UNKNOWN"
+    item = "UNKNOWN"
+    depend = "UNKNOWN"
+    trial_index = None
+    fps = None
+    display_coords = None
+    trial_var_block_idx = -1
+    trials_dict = dict(paragraph_trials=[], paragraph_trial_IDs=[])
+    trial_idx = -1
+    trial_var_block_start_idx = -1
+    removed_trial_ids = []
+    ias_file = ""
+    trial_var_block_lines_list = []
+    if "\n".join(map(str.strip, lines)).find("TRIAL_VAR") != -1:
+        for idx, l in enumerate(tqdm(lines, desc=f"Checking for TRIAL_VAR lines for {filename}")):
+            if trial_var_block_start_idx == -1 and "MSG" not in l:
+                continue
+            if "TRIAL_VAR" in l:
+                if trial_var_block_start_idx == -1:
+                    trial_var_block_start_idx = idx
+                continue
+            else:
+                if trial_var_block_start_idx != -1:
+                    trial_var_block_stop_idx = idx
+                    trial_var_block_lines = [
+                        x.strip() for x in lines[trial_var_block_start_idx:trial_var_block_stop_idx]
+                    ]
+                    trial_var_block_lines_list.append(trial_var_block_lines)
+                trial_var_block_start_idx = -1
+        has_trial_var_lines = len(trial_var_block_lines_list) > 0
+    else:
+        has_trial_var_lines = False
+    for idx, l in enumerate(lines):
+        if "MSG" not in l:
+            continue
+        parts = l.strip().split(" ")
+        if "TRIALID" in l:
+            trial_id = re.split(r"[ :\t]+", l.strip())[-1]
+            trial_id_timestamp = parts[1]
+            trial_idx += 1
+            if trial_id[0] in ["F", "P", "E"]:
+                parse_dict = emf.parse_itemID(trial_id)
+                condition = parse_dict["condition"]
+                item = parse_dict["item"]
+                depend = parse_dict["depend"]
+            else:
+                parse_dict = {}
+            if trial_id[0] == "F":
+                trial_is = "question"
+            elif trial_id[0] == "P":
+                trial_is = "practice"
+            else:
+                if has_trial_var_lines:
+                    trial_var_block_idx += 1
+                    trial_var_block_lines = trial_var_block_lines_list[trial_var_block_idx]
+                    image_lines = [s for s in trial_var_block_lines if "img" in s]
+                    if len(image_lines) > 0:
+                        item = image_lines[0].split(" ")[-1]
+                    cond_lines = [s for s in trial_var_block_lines if "cond" in s]
+                    if len(cond_lines) > 0:
+                        condition = cond_lines[0].split(" ")[-1]
+                    item_lines = [s for s in trial_var_block_lines if "item" in s]
+                    if len(item_lines) > 0:
+                        item = item_lines[0].split(" ")[-1]
+                    trial_index_lines = [s for s in trial_var_block_lines if "Trial_Index" in s]
+                    if len(trial_index_lines) > 0:
+                        trial_index = trial_index_lines[0].split(" ")[-1]
+                    question_key_lines = [s for s in trial_var_block_lines if "QUESTION_KEY_PRESSED" in s]
+                    if len(question_key_lines) > 0:
+                        question_answer = question_key_lines[0].split(" ")[-1]
+                    question_response_lines = [s for s in trial_var_block_lines if " RESPONSE" in s]
+                    if len(question_response_lines) > 0:
+                        question_answer = question_response_lines[0].split(" ")[-1]
+                    question_correct_lines = [
+                        s for s in trial_var_block_lines if ("QUESTION_ACCURACY" in s) | (" ACCURACY" in s)
+                    ]
+                    if len(question_correct_lines) > 0:
+                        question_correct = question_correct_lines[0].split(" ")[-1]
+                    trial_is_lines = [s for s in trial_var_block_lines if "trial" in s]
+                    if len(trial_is_lines) > 0:
+                        trial_is_line = trial_is_lines[0].split(" ")[-1]
+                        if "pract" in trial_is_line or "end" in trial_is_line:
+                            trial_is = "practice"
+                            trial_id = f"{trial_is}_{trial_id}"
+                        else:
+                            trial_is = "paragraph"
+                            trial_id = f"{condition}_{trial_is}_{trial_id}"
+                            trials_dict["paragraph_trials"].append(trial_idx)
+                            trials_dict["paragraph_trial_IDs"].append(trial_id)
+                    else:
+                        trial_is = "paragraph"
+                        trial_id = f"{condition}_{trial_is}_{trial_id}_{trial_idx}"
+                        trials_dict["paragraph_trials"].append(trial_idx)
+                        trials_dict["paragraph_trial_IDs"].append(trial_id)
+                else:
+                    if len(trial_id) > 1:
+                        condition = trial_id[1]
+                    trial_is = "paragraph"
+                    trials_dict["paragraph_trials"].append(trial_idx)
+                    trials_dict["paragraph_trial_IDs"].append(trial_id)
+            trials_dict[trial_idx] = dict(
+                subject=subject,
+                filename=filename,
+                trial_idx=trial_idx,
+                trial_id=trial_id,
+                trial_id_idx=idx,
+                trial_id_timestamp=trial_id_timestamp,
+                trial_is=trial_is,
+                trial_var_block_lines=trial_var_block_lines,
+                seq=trial_idx,
+                item=item,
+                depend=depend,
+                condition=condition,
+                parse_dict=parse_dict,
+            )
+            if question_answer is not None:
+                trials_dict[trial_idx]["question_answer"] = question_answer
+            if question_correct is not None:
+                trials_dict[trial_idx]["question_correct"] = question_correct
+            if trial_index is not None:
+                trials_dict[trial_idx]["trial_index"] = trial_index
+            last_trial_skipped = False
+        elif "TRIAL_RESULT" in l or "stop_trial" in l:
+            trials_dict[trial_idx]["trial_result_idx"] = idx
+            trials_dict[trial_idx]["trial_result_timestamp"] = int(parts[0].split("\t")[1])
+            if len(parts) > 2:
+                trials_dict[trial_idx]["trial_result_number"] = int(parts[2])
+        elif "QUESTION_ANSWER" in l and not has_trial_var_lines:
+            trials_dict[trial_idx]["question_answer_idx"] = idx
+            trials_dict[trial_idx]["question_answer_timestamp"] = int(parts[0].split("\t")[1])
+            if len(parts) > 2:
+                trials_dict[trial_idx]["question_answer_question_trial"] = int(
+                    pd.to_numeric(l.strip().split(" ")[-1].strip(), errors="coerce")
+                )
+        elif "KEYBOARD" in l:
+            trials_dict[trial_idx]["keyboard_press_idx"] = idx
+            trials_dict[trial_idx]["keyboard_press_timestamp"] = int(parts[0].split("\t")[1])
+        elif "DISPLAY COORDS" in l and display_coords is None:
+            display_coords = (float(parts[-4]), float(parts[-3]), float(parts[-2]), float(parts[-1]))
+        elif "GAZE_COORDS" in l and display_coords is None:
+            display_coords = (float(parts[-4]), float(parts[-3]), float(parts[-2]), float(parts[-1]))
+        elif "FRAMERATE" in l:
+            l_idx = parts.index(metadata_strs[2])
+            fps = float(parts[l_idx + 1])
+        elif "TRIAL ABORTED" in l or "TRIAL REPEATED" in l:
+            if not last_trial_skipped:
+                if trial_is == "paragraph":
+                    trials_dict["paragraph_trials"].remove(trial_idx)
+                trial_idx -= 1
+                removed_trial_ids.append(trial_id)
+                last_trial_skipped = True
+        elif "IAREA FILE" in l:
+            ias_file = parts[-1]
+            ias_file_stem = ias_file.split("/")[-1].split("\\")[-1].split(".")[0]
+            trials_dict[trial_idx]["ias_file_from_asc"] = ias_file
+            trials_dict[trial_idx]["ias_file"] = ias_file_stem
+            if item == "UNKNOWN":
+                trials_dict[trial_idx]["item"] = ias_file_stem
+            if ias_file_stem in ias_files_dict:
+                try:
+                    ias_file = ias_files_dict[ias_file_stem]
+                    ias_df = read_ias_file(ias_file, prefix="word")  # TODO make option if word or chars in ias
+                    trials_dict[trial_idx]["words_list"] = ias_df.to_dict("records")
+                    trials_dict[trial_idx]["chars_list"] = get_chars_list_from_words_list(ias_df, prefix="word")
+                except Exception as e:
+                    ic(f"Reading ias file failed")
+                    ic(e)
+            else:
+                ic(f"IAS file {ias_file_stem} not found")
+        elif "CALIBRATION" in l and "MSG" in l:
+            calibration_method = parts[3].strip()
+            if trial_idx > -1:
+                trials_dict[trial_idx]["calibration_method"] = calibration_method
+        elif "VALIDATION" in l and "MSG" in l and "ABORTED" not in l:
+            try:
+                calibration_time_line_parts = re.split(r"[ :\t]+", l.strip())
+                calibration_time.append(float(calibration_time_line_parts[1]))
+                calibration_avg_error.append(float(calibration_time_line_parts[9]))
+                calibration_max_error.append(float(calibration_time_line_parts[11]))
+                calibration_offset.append(float(calibration_time_line_parts[14]))
+                x_px.append(float(calibration_time_line_parts[-2].split(",")[0]))
+                y_px.append(float(calibration_time_line_parts[-2].split(",")[1]))
+            except Exception as e:
+                ic(f"parsing VALIDATION failed for line {l}")
+    trials_df = pd.DataFrame([trials_dict[i] for i in range(trial_idx) if i in trials_dict])
+    if (
+        question_correct is None
+        and "trial_result_number" in trials_df.columns
+        and "question_answer_question_trial" in trials_df.columns
+    ):
+        trials_df["question_answer_selection"] = trials_df["trial_result_number"].shift(-1).values
+        trials_df["correct_trial_answer_would_be"] = trials_df["question_answer_question_trial"].shift(-1).values
+        trials_df["question_correct"] = [
+            check_values(a, b)
+            for a, b in zip(trials_df["question_answer_selection"], trials_df["correct_trial_answer_would_be"])
+        ]
+        for pidx, prow in trials_df.loc[trials_df.trial_is == "paragraph", :].iterrows():
+            trials_dict[pidx]["question_correct"] = prow["question_correct"]
+            if prow["question_correct"] is not None:
+                trials_dict[pidx]["question_answer_selection"] = prow["question_answer_selection"]
+                trials_dict[pidx]["correct_trial_answer_would_be"] = prow["correct_trial_answer_would_be"]
+            else:
+                trials_dict[pidx]["question_answer_selection"] = None
+                trials_dict[pidx]["correct_trial_answer_would_be"] = None
+    if "question_correct" in trials_df.columns:
+        paragraph_trials_df = trials_df.loc[trials_df.trial_is == "paragraph", :]
+        overall_question_answer_value_counts = (
+            paragraph_trials_df["question_correct"].dropna().astype(int).value_counts().to_dict()
+        )
+        overall_question_answer_value_counts_normed = (
+            paragraph_trials_df["question_correct"].dropna().astype(int).value_counts(normalize=True).to_dict()
+        )
+    else:
+        overall_question_answer_value_counts = None
+        overall_question_answer_value_counts_normed = None
+    if paragraph_trials_only:
+        trials_dict_temp = trials_dict.copy()
+        for k in trials_dict_temp.keys():
+            if k not in ["paragraph_trials"] + trials_dict_temp["paragraph_trials"]:
+                trials_dict.pop(k)
+        if len(trials_dict_temp["paragraph_trials"]):
+            trial_idx = trials_dict_temp["paragraph_trials"][-1]
+        else:
+            return trials_dict
+    trials_dict["display_coords"] = display_coords
+    trials_dict["fps"] = fps
+    trials_dict["max_trial_idx"] = trial_idx
+    trials_dict["overall_question_answer_value_counts"] = overall_question_answer_value_counts
+    trials_dict["overall_question_answer_value_counts_normed"] = overall_question_answer_value_counts_normed
+    enum = (
+        trials_dict["paragraph_trials"]
+        if ("paragraph_trials" in trials_dict.keys() and paragraph_trials_only)
+        else range(len(trials_dict))
+    )
+    for trial_idx in enum:
+        if trial_idx not in trials_dict.keys():
+            continue
+        if "chars_list" in trials_dict[trial_idx]:
+            chars_list = trials_dict[trial_idx]["chars_list"]
+        else:
+            chars_list = []
+        if "display_coords" not in trials_dict[trial_idx].keys():
+            trials_dict[trial_idx]["display_coords"] = trials_dict["display_coords"]
+        trials_dict[trial_idx]["overall_question_answer_value_counts"] = trials_dict[
+            "overall_question_answer_value_counts"
+        ]
+        trials_dict[trial_idx]["overall_question_answer_value_counts_normed"] = trials_dict[
+            "overall_question_answer_value_counts_normed"
+        ]
+        trial_start_idx = trials_dict[trial_idx]["trial_id_idx"]
+        trial_end_idx = trials_dict[trial_idx]["trial_result_idx"]
+        trial_lines = lines[trial_start_idx:trial_end_idx]
+        if len(y_px) > 0:
+            trials_dict[trial_idx]["y_px"] = y_px
+            trials_dict[trial_idx]["x_px"] = x_px
+            if "calibration_method" not in trials_dict[trial_idx]:
+                trials_dict[trial_idx]["calibration_method"] = calibration_method
+            trials_dict[trial_idx]["calibration_offset"] = calibration_offset
+            trials_dict[trial_idx]["calibration_max_error"] = calibration_max_error
+            trials_dict[trial_idx]["calibration_time"] = calibration_time
+            trials_dict[trial_idx]["calibration_avg_error"] = calibration_avg_error
+        for idx, l in enumerate(trial_lines):
+            parts = l.strip().split(" ")
+            if "START" in l and " MSG" not in l:
+                trials_dict[trial_idx]["text_end_idx"] = trial_start_idx + idx
+                trials_dict[trial_idx]["start_idx"] = trial_start_idx + idx + 7
+                trials_dict[trial_idx]["start_time"] = int(parts[0].split("\t")[1])
+            elif "END" in l and "ENDBUTTON" not in l and " MSG" not in l:
+                trials_dict[trial_idx]["end_idx"] = trial_start_idx + idx - 2
+                trials_dict[trial_idx]["end_time"] = int(parts[0].split("\t")[1])
+            elif "MSG" not in l:
+                continue
+            elif "ENDBUTTON" in l:
+                trials_dict[trial_idx]["endbutton_idx"] = trial_start_idx + idx
+                trials_dict[trial_idx]["endbutton_time"] = int(parts[0].split("\t")[1])
+            elif "SYNCTIME" in l:
+                trials_dict[trial_idx]["synctime"] = trial_start_idx + idx
+                trials_dict[trial_idx]["synctime_time"] = int(parts[0].split("\t")[1])
+            elif start_trial_at_keyword in l:
+                trials_dict[trial_idx][f"{start_trial_at_keyword}_line_idx"] = trial_start_idx + idx
+                trials_dict[trial_idx][f"{start_trial_at_keyword}_time"] = int(parts[0].split("\t")[1])
+            elif "GAZE TARGET OFF" in l:
+                trials_dict[trial_idx]["gaze_targ_off_time"] = int(parts[0].split("\t")[1])
+            elif "GAZE TARGET ON" in l:
+                trials_dict[trial_idx]["gaze_targ_on_time"] = int(parts[0].split("\t")[1])
+                trials_dict[trial_idx]["gaze_targ_on_time_idx"] = trial_start_idx + idx
+            elif "DISPLAY_SENTENCE" in l:  # some .asc files seem to use this
+                trials_dict[trial_idx]["gaze_targ_on_time"] = int(parts[0].split("\t")[1])
+                trials_dict[trial_idx]["gaze_targ_on_time_idx"] = trial_start_idx + idx
+            elif "DISPLAY TEXT" in l:
+                trials_dict[trial_idx]["text_start_idx"] = trial_start_idx + idx
+            elif "REGION CHAR" in l:
+                rg_idx = parts.index("CHAR")
+                if len(parts[rg_idx:]) > 8:
+                    char = " "
+                    idx_correction = 1
+                elif len(parts[rg_idx:]) == 3:
+                    char = " "
+                    if "REGION CHAR" not in trial_lines[idx + 1]:
+                        parts = trial_lines[idx + 1].strip().split(" ")
+                        idx_correction = -rg_idx - 4
+                else:
+                    char = parts[rg_idx + 3]
+                    idx_correction = 0
+                try:
+                    char_dict = {
+                        "char": char,
+                        "char_xmin": float(parts[rg_idx + 4 + idx_correction]),
+                        "char_ymin": float(parts[rg_idx + 5 + idx_correction]),
+                        "char_xmax": float(parts[rg_idx + 6 + idx_correction]),
+                        "char_ymax": float(parts[rg_idx + 7 + idx_correction]),
+                    }
+                    char_dict["char_y_center"] = round(
+                        (char_dict["char_ymax"] - char_dict["char_ymin"]) / 2 + char_dict["char_ymin"], ndigits=2
+                    )
+                    char_dict["char_x_center"] = round(
+                        (char_dict["char_xmax"] - char_dict["char_xmin"]) / 2 + char_dict["char_xmin"], ndigits=2
+                    )
+                    chars_list.append(char_dict)
+                except Exception as e:
+                    ic(f"char_dict creation failed for parts {parts}")
+                    ic(e)
+        if start_trial_at_keyword == "SYNCTIME" and "synctime_time" in trials_dict[trial_idx]:
+            trials_dict[trial_idx]["trial_start_time"] = trials_dict[trial_idx]["synctime_time"]
+            trials_dict[trial_idx]["trial_start_idx"] = trials_dict[trial_idx]["synctime"]
+        elif start_trial_at_keyword == "GAZE TARGET ON" and "gaze_targ_on_time" in trials_dict[trial_idx]:
+            trials_dict[trial_idx]["trial_start_time"] = trials_dict[trial_idx]["gaze_targ_on_time"]
+            trials_dict[trial_idx]["trial_start_idx"] = trials_dict[trial_idx]["gaze_targ_on_time_idx"]
+        elif start_trial_at_keyword == "START":
+            trials_dict[trial_idx]["trial_start_time"] = trials_dict[trial_idx]["start_time"]
+            trials_dict[trial_idx]["trial_start_idx"] = trials_dict[trial_idx]["start_idx"]
+        elif f"{start_trial_at_keyword}_time" in trials_dict[trial_idx]:
+            trials_dict[trial_idx]["trial_start_time"] = trials_dict[trial_idx][f"{start_trial_at_keyword}_time"]
+            trials_dict[trial_idx]["trial_start_idx"] = trials_dict[trial_idx][f"{start_trial_at_keyword}_line_idx"]
+        else:
+            trials_dict[trial_idx]["trial_start_time"] = trials_dict[trial_idx]["start_time"]
+            trials_dict[trial_idx]["trial_start_idx"] = trials_dict[trial_idx]["start_idx"]
+        if end_trial_at_keyword == "ENDBUTTON" and "endbutton_time" in trials_dict[trial_idx]:
+            trials_dict[trial_idx]["trial_end_time"] = trials_dict[trial_idx]["endbutton_time"]
+            trials_dict[trial_idx]["trial_end_idx"] = trials_dict[trial_idx]["endbutton_idx"]
+        elif end_trial_at_keyword == "END" and "end_idx" in trials_dict[trial_idx]:
+            trials_dict[trial_idx]["trial_end_time"] = trials_dict[trial_idx]["end_time"]
+            trials_dict[trial_idx]["trial_end_idx"] = trials_dict[trial_idx]["end_idx"]
+        elif end_trial_at_keyword == "KEYBOARD" and "keyboard_press_idx" in trials_dict[trial_idx]:
+            trials_dict[trial_idx]["trial_end_idx"] = trials_dict[trial_idx]["keyboard_press_idx"]
+        else:
+            trials_dict[trial_idx]["trial_end_idx"] = trials_dict[trial_idx]["trial_result_idx"]
+        if trials_dict[trial_idx]["trial_end_idx"] < trials_dict[trial_idx]["trial_start_idx"]:
+            raise ValueError(f"trial_start_idx is larger than trial_end_idx for trial_idx {trial_idx}")
+        if len(chars_list) > 0:
+            line_ycoords = []
+            for idx in range(len(chars_list)):
+                chars_list[idx]["char_y_center"] = round(
+                    (chars_list[idx]["char_ymax"] - chars_list[idx]["char_ymin"]) / 2 + chars_list[idx]["char_ymin"],
+                    ndigits=2,
+                )
+                if chars_list[idx]["char_y_center"] not in line_ycoords:
+                    line_ycoords.append(chars_list[idx]["char_y_center"])
+            for idx in range(len(chars_list)):
+                chars_list[idx]["assigned_line"] = line_ycoords.index(chars_list[idx]["char_y_center"])
+            letter_width_avg = np.mean(
+                [x["char_xmax"] - x["char_xmin"] for x in chars_list if x["char_xmax"] > x["char_xmin"]]
+            )
+            line_heights = [round(abs(x["char_ymax"] - x["char_ymin"]), 3) for x in chars_list]
+            line_xcoords_all = [x["char_x_center"] for x in chars_list]
+            line_xcoords_no_pad = np.unique(line_xcoords_all)
+            line_ycoords_all = [x["char_y_center"] for x in chars_list]
+            line_ycoords_no_pad = np.unique(line_ycoords_all)
+            trials_dict[trial_idx]["x_char_unique"] = list(line_xcoords_no_pad)
+            trials_dict[trial_idx]["y_char_unique"] = list(line_ycoords_no_pad)
+            x_diff, y_diff = calc_xdiff_ydiff(
+                line_xcoords_no_pad, line_ycoords_no_pad, line_heights, allow_multiple_values=False
+            )
+            trials_dict[trial_idx]["x_diff"] = float(x_diff)
+            trials_dict[trial_idx]["y_diff"] = float(y_diff)
+            trials_dict[trial_idx]["num_char_lines"] = len(line_ycoords_no_pad)
+            trials_dict[trial_idx]["letter_width_avg"] = letter_width_avg
+            trials_dict[trial_idx]["line_heights"] = line_heights
+            words_list_from_func, chars_list_reconstructed = add_words(chars_list)
+            words_list = words_list_from_func
+            if close_gap_between_words:  # TODO this may need to change the "in_word" col for the chars_df
+                for widx in range(1, len(words_list)):
+                    if words_list[widx]["assigned_line"] == words_list[widx - 1]["assigned_line"]:
+                        word_sep_half_width = (words_list[widx]["word_xmin"] - words_list[widx - 1]["word_xmax"]) / 2
+                        words_list[widx - 1]["word_xmax"] = words_list[widx - 1]["word_xmax"] + word_sep_half_width
+                        words_list[widx]["word_xmin"] = words_list[widx]["word_xmin"] - word_sep_half_width
+            else:
+                chars_df = pd.DataFrame(chars_list_reconstructed)
+                chars_df.loc[
+                    chars_df["char"] == " ", ["in_word", "in_word_number", "num_letters_from_start_of_word"]
+                ] = pd.NA
+                chars_list_reconstructed = chars_df.to_dict("records")
+            trials_dict[trial_idx]["words_list"] = words_list
+            trials_dict[trial_idx]["chars_list"] = chars_list_reconstructed
+    return trials_dict
+def get_lines_from_file(uploaded_file, asc_encoding="ISO-8859-15"):
+    if isinstance(uploaded_file, str) or isinstance(uploaded_file, pl.Path):
+        with open(uploaded_file, "r", encoding=asc_encoding) as f:
+            lines = f.readlines()
+    else:
+        stringio = StringIO(uploaded_file.getvalue().decode(asc_encoding))
+        loaded_str = stringio.read()
+        lines = loaded_str.split("\n")
+    return lines
+def file_to_trials_and_lines(
+    uploaded_file,
+    asc_encoding: str = "ISO-8859-15",
+    close_gap_between_words=True,
+    paragraph_trials_only=True,
+    uploaded_ias_files=[],
+    trial_start_keyword="START",
+    end_trial_at_keyword="END",
+):
+    lines = get_lines_from_file(uploaded_file, asc_encoding=asc_encoding)
+    trials_dict = asc_lines_to_trials_by_trail_id(
+        lines,
+        paragraph_trials_only,
+        uploaded_file,
+        close_gap_between_words=close_gap_between_words,
+        ias_files=uploaded_ias_files,
+        start_trial_at_keyword=trial_start_keyword,
+        end_trial_at_keyword=end_trial_at_keyword,
+    )
+    if "paragraph_trials" not in trials_dict.keys() and "trial_is" in trials_dict[0].keys():
+        paragraph_trials = []
+        for k in range(trials_dict["max_trial_idx"]):
+            if trials_dict[k]["trial_is"] == "paragraph":
+                paragraph_trials.append(k)
+        trials_dict["paragraph_trials"] = paragraph_trials
+    enum = (
+        trials_dict["paragraph_trials"]
+        if paragraph_trials_only and "paragraph_trials" in trials_dict.keys()
+        else range(trials_dict["max_trial_idx"])
+    )
+    for k in enum:
+        if "chars_list" in trials_dict[k].keys():
+            max_line = trials_dict[k]["chars_list"][-1]["assigned_line"]
+            words_on_lines = {x: [] for x in range(max_line + 1)}
+            [words_on_lines[x["assigned_line"]].append(x["char"]) for x in trials_dict[k]["chars_list"]]
+            line_list = ["".join([s for s in v]) for idx, v in words_on_lines.items()]
+            sentences_temp = "".join([x["char"] for x in trials_dict[k]["chars_list"]])
+            sentences = re.split(r"(?<!\w\.\w.)(?<![A-Z]\.)(?<![A-Z][a-z]\.)(?<=\.|\?)", sentences_temp)
+            text = "\n".join([x for x in line_list])
+            trials_dict[k]["sentence_list"] = [s for s in sentences if len(s) > 0]
+            trials_dict[k]["line_list"] = line_list
+            trials_dict[k]["text"] = text
+            trials_dict[k]["max_line"] = max_line
+    return trials_dict, lines
+def discard_empty_str_from_list(l):
+    return [x for x in l if len(x) > 0]
+def make_folders(gradio_temp_folder, gradio_temp_unzipped_folder, PLOTS_FOLDER):
+    gradio_temp_folder.mkdir(exist_ok=True)
+    gradio_temp_unzipped_folder.mkdir(exist_ok=True)
+    PLOTS_FOLDER.mkdir(exist_ok=True)
+    return 0
+def plotly_plot_with_image(
+    dffix,
+    trial,
+    algo_choice,
+    saccade_df=None,
+    to_plot_list=["Uncorrected Fixations", "Corrected Fixations", "Word boxes"],
+    lines_in_plot="Uncorrected",
+    scale_factor=0.5,
+    font="DejaVu Sans Mono",
+    box_annotations: list = None,
+):
+    mpl_fig, img_width, img_height = matplotlib_plot_df(
+        dffix,
+        trial,
+        algo_choice,
+        None,
+        desired_dpi=300,
+        fix_to_plot=[],
+        stim_info_to_plot=to_plot_list,
+        font=font,
+        box_annotations=box_annotations,
+    )
+    mpl_fig.savefig(TEMP_FIGURE_STIMULUS_PATH)
+    plt.close(mpl_fig)
+    if lines_in_plot == "Uncorrected":
+        uncorrected_plot_mode = "markers+lines+text"
+    else:
+        uncorrected_plot_mode = "markers+text"
+    if lines_in_plot == "Corrected":
+        corrected_plot_mode = "markers+lines+text"
+    else:
+        corrected_plot_mode = "markers+text"
+    if lines_in_plot == "Both":
+        uncorrected_plot_mode = "markers+lines+text"
+        corrected_plot_mode = "markers+lines+text"
+    fig = go.Figure()
+    fig.add_trace(
+        go.Scatter(
+            x=[0, img_width * scale_factor],
+            y=[img_height * scale_factor, 0],
+            mode="markers",
+            marker_opacity=0,
+            name="scale_helper",
+        )
+    )
+    fig.update_xaxes(visible=False, range=[0, img_width * scale_factor])
+    fig.update_yaxes(
+        visible=False,
+        range=[img_height * scale_factor, 0],
+        scaleanchor="x",
+    )
+    if (
+        "Words" in to_plot_list
+        or "Word boxes" in to_plot_list
+        or "Character boxes" in to_plot_list
+        or "Characters" in to_plot_list
+    ):
+        imsource = Image.open(str(TEMP_FIGURE_STIMULUS_PATH))
+        fig.add_layout_image(
+            dict(
+                x=0,
+                sizex=img_width * scale_factor,
+                y=0,
+                sizey=img_height * scale_factor,
+                xref="x",
+                yref="y",
+                opacity=1.0,
+                layer="below",
+                sizing="stretch",
+                source=imsource,
+            )
+        )
+    duration_scaled = dffix.duration - dffix.duration.min()
+    duration_scaled = ((duration_scaled / duration_scaled.max()) - 0.5) * 3
+    duration = sigmoid(duration_scaled) * 50 * scale_factor
+    if "Uncorrected Fixations" in to_plot_list:
+        fig.add_trace(
+            go.Scatter(
+                x=dffix.x * scale_factor,
+                y=dffix.y * scale_factor,
+                mode=uncorrected_plot_mode,
+                name="Raw fixations",
+                marker=dict(
+                    color=COLORS[-1],
+                    symbol="arrow",
+                    size=duration.values,
+                    angleref="previous",
+                ),
+                line=dict(color=COLORS[-1], width=2 * scale_factor),
+                text=np.arange(dffix.shape[0]),
+                textposition="top right",
+                textfont=dict(
+                    family="sans serif",
+                    size=23 * scale_factor,
+                    color=COLORS[-1],
+                ),
+                hovertext=[f"x:{x}, y:{y}, n:{num}" for x, y, num in zip(dffix.x, dffix[f"y"], range(dffix.shape[0]))],
+                opacity=0.9,
+            )
+        )
+    if "Corrected Fixations" in to_plot_list:
+        if isinstance(algo_choice, list):
+            algo_choices = algo_choice
+            repeats = range(len(algo_choice))
+        else:
+            algo_choices = [algo_choice]
+            repeats = range(1)
+        for algoIdx in repeats:
+            algo_choice = algo_choices[algoIdx]
+            if f"y_{algo_choice}" in dffix.columns:
+                fig.add_trace(
+                    go.Scatter(
+                        x=dffix.x * scale_factor,
+                        y=dffix.loc[:, f"y_{algo_choice}"] * scale_factor,
+                        mode=corrected_plot_mode,
+                        name=algo_choice,
+                        marker=dict(
+                            color=COLORS[algoIdx],
+                            symbol="arrow",
+                            size=duration.values,
+                            angleref="previous",
+                        ),
+                        line=dict(color=COLORS[algoIdx], width=1.5 * scale_factor),
+                        text=np.arange(dffix.shape[0]),
+                        textposition="top center",
+                        textfont=dict(
+                            family="sans serif",
+                            size=22 * scale_factor,
+                            color=COLORS[algoIdx],
+                        ),
+                        hovertext=[
+                            f"x:{x}, y:{y}, n:{num}"
+                            for x, y, num in zip(dffix.x, dffix[f"y_{algo_choice}"], range(dffix.shape[0]))
+                        ],
+                        opacity=0.9,
+                    )
+                )
+    if "Saccades" in to_plot_list:
+        duration_scaled = saccade_df.duration - saccade_df.duration.min()
+        duration_scaled = ((duration_scaled / duration_scaled.max()) - 0.5) * 3
+        duration = sigmoid(duration_scaled) * 65 * scale_factor
+        starting_coordinates = [tuple(row * scale_factor) for row in saccade_df.loc[:, ["xs", "ys"]].values]
+        ending_coordinates = [tuple(row * scale_factor) for row in saccade_df.loc[:, ["xe", "ye"]].values]
+        for sidx, (start, end) in enumerate(zip(starting_coordinates, ending_coordinates)):
+            if sidx == 0:
+                show_legend = True
+            else:
+                show_legend = False
+            fig.add_trace(
+                go.Scatter(
+                    x=[start[0], end[0]],
+                    y=[start[1], end[1]],
+                    mode="markers+lines+text",
+                    line=dict(color=COLORS[-1], width=1.5 * scale_factor, dash="dash"),
+                    showlegend=show_legend,
+                    legendgroup="1",
+                    name="Saccades",
+                    text=sidx,
+                    textposition="top center",
+                    textfont=dict(family="sans serif", size=22 * scale_factor, color=COLORS[-1]),
+                    marker=dict(
+                        color=COLORS[-1],
+                        symbol="arrow",
+                        size=duration.values,
+                        angleref="previous",
+                    ),
+                )
+            )
+    if "Saccades snapped to line" in to_plot_list:
+        duration_scaled = saccade_df.duration - saccade_df.duration.min()
+        duration_scaled = ((duration_scaled / duration_scaled.max()) - 0.5) * 3
+        duration = sigmoid(duration_scaled) * 65 * scale_factor
+        if isinstance(algo_choice, list):
+            algo_choices = algo_choice
+            repeats = range(len(algo_choice))
+        else:
+            algo_choices = [algo_choice]
+            repeats = range(1)
+        for algoIdx in repeats:
+            algo_choice = algo_choices[algoIdx]
+            if f"ys_{algo_choice}" in saccade_df.columns:
+                starting_coordinates = [
+                    tuple(row * scale_factor) for row in saccade_df.loc[:, ["xs", f"ys_{algo_choice}"]].values
+                ]
+                ending_coordinates = [
+                    tuple(row * scale_factor) for row in saccade_df.loc[:, ["xe", f"ye_{algo_choice}"]].values
+                ]
+                for sidx, (start, end) in enumerate(zip(starting_coordinates, ending_coordinates)):
+                    if sidx == 0:
+                        show_legend = True
+                    else:
+                        show_legend = False
+                    fig.add_trace(
+                        go.Scatter(
+                            x=[start[0], end[0]],
+                            y=[start[1], end[1]],
+                            mode="markers+lines",
+                            line=dict(color=COLORS[algoIdx], width=1.5 * scale_factor, dash="dash"),
+                            showlegend=show_legend,
+                            legendgroup="2",
+                            text=sidx,
+                            textposition="top center",
+                            textfont=dict(family="sans serif", size=22 * scale_factor, color=COLORS[algoIdx]),
+                            name="Saccades snapped to line",
+                            marker=dict(
+                                color=COLORS[algoIdx],
+                                symbol="arrow",
+                                size=duration.values,
+                                angleref="previous",
+                            ),
+                        )
+                    )
+    fig.update_layout(
+        plot_bgcolor=None,
+        width=img_width * scale_factor,
+        height=img_height * scale_factor,
+        margin={"l": 0, "r": 0, "t": 0, "b": 0},
+        legend=dict(orientation="h", yanchor="bottom", y=-0.1, xanchor="right", x=0.8),
+    )
+    for trace in fig["data"]:
+        if trace["name"] == "scale_helper":
+            trace["showlegend"] = False
+    return fig
+def plot_fix_measure(
+    dffix,
+    plot_choices,
+    x_axis_selection,
+    margin=dict(t=40, l=10, r=10, b=1),
+    label_start="Fixation",
+):
+    y_label = f"{label_start} Feature"
+    if x_axis_selection == "Index":
+        num_datapoints = dffix.shape[0]
+        x_label = f"{label_start} Number"
+        x_nums = np.arange(num_datapoints)
+    elif x_axis_selection == "Start Time":
+        x_label = f"{label_start} Start Time"
+        x_nums = dffix["start_time"]
+    layout = dict(
+        plot_bgcolor="white",
+        autosize=True,
+        margin=margin,
+        xaxis=dict(
+            title=x_label,
+            linecolor="black",
+            range=[x_nums.min() - 1, x_nums.max() + 1],
+            showgrid=False,
+            mirror="all",
+            showline=True,
+        ),
+        yaxis=dict(
+            title=y_label,
+            side="left",
+            linecolor="black",
+            showgrid=False,
+            mirror="all",
+            showline=True,
+        ),
+        legend=dict(orientation="v", yanchor="middle", y=0.95, xanchor="left", x=1.05),
+    )
+    fig = go.Figure(layout=layout)
+    for pidx, plot_choice in enumerate(plot_choices):
+        fig.add_trace(
+            go.Scatter(
+                x=x_nums,
+                y=dffix.loc[:, plot_choice],
+                mode="markers",
+                name=plot_choice,
+                marker_color=COLORS[pidx],
+                marker_size=3,
+                showlegend=True,
+            )
+        )
+    fig.update_yaxes(zeroline=True, zerolinewidth=1, zerolinecolor="black")
+    return fig
+def plot_y_corr(dffix, algo_choice, margin=dict(t=40, l=10, r=10, b=1)):
+    num_datapoints = len(dffix.x)
+    layout = dict(
+        plot_bgcolor="white",
+        autosize=True,
+        margin=margin,
+        xaxis=dict(
+            title="Fixation Index",
+            linecolor="black",
+            range=[-1, num_datapoints + 1],
+            showgrid=False,
+            mirror="all",
+            showline=True,
+        ),
+        yaxis=dict(
+            title="y correction",
+            side="left",
+            linecolor="black",
+            showgrid=False,
+            mirror="all",
+            showline=True,
+        ),
+        legend=dict(orientation="v", yanchor="middle", y=0.95, xanchor="left", x=1.05),
+    )
+    if isinstance(dffix, dict):
+        dffix = dffix["value"]
+    algo_string = algo_choice[0] if isinstance(algo_choice, list) else algo_choice
+    if f"y_{algo_string}_correction" not in dffix.columns:
+        ic("No line-assignment column found in dataframe")
+        return go.Figure(layout=layout)
+    if isinstance(dffix, dict):
+        dffix = dffix["value"]
+    fig = go.Figure(layout=layout)
+    if isinstance(algo_choice, list):
+        algo_choices = algo_choice
+        repeats = range(len(algo_choice))
+    else:
+        algo_choices = [algo_choice]
+        repeats = range(1)
+    for algoIdx in repeats:
+        algo_choice = algo_choices[algoIdx]
+        fig.add_trace(
+            go.Scatter(
+                x=np.arange(num_datapoints),
+                y=dffix.loc[:, f"y_{algo_choice}_correction"],
+                mode="markers",
+                name=f"{algo_choice} y correction",
+                marker_color=COLORS[algoIdx],
+                marker_size=3,
+                showlegend=True,
+            )
+        )
+    fig.update_yaxes(zeroline=True, zerolinewidth=1, zerolinecolor="black")
+    return fig
+def download_example_ascs(EXAMPLES_FOLDER, EXAMPLES_ASC_ZIP_FILENAME, OSF_DOWNLAOD_LINK, EXAMPLES_FOLDER_PATH):
+    if not os.path.isdir(EXAMPLES_FOLDER):
+        os.mkdir(EXAMPLES_FOLDER)
+    if not os.path.exists(EXAMPLES_ASC_ZIP_FILENAME):
+        download_url(OSF_DOWNLAOD_LINK, EXAMPLES_ASC_ZIP_FILENAME)
+    if os.path.exists(EXAMPLES_ASC_ZIP_FILENAME):
+        if EXAMPLES_FOLDER_PATH.exists():
+            EXAMPLE_ASC_FILES = [x for x in EXAMPLES_FOLDER_PATH.glob("*.asc")]
+        if len(EXAMPLE_ASC_FILES) != 4:
+            try:
+                with zipfile.ZipFile(EXAMPLES_ASC_ZIP_FILENAME, "r") as zip_ref:
+                    zip_ref.extractall(EXAMPLES_FOLDER)
+            except Exception as e:
+                ic(e)
+                ic(f"Extracting {EXAMPLES_ASC_ZIP_FILENAME} failed")
+        EXAMPLE_ASC_FILES = [x for x in EXAMPLES_FOLDER_PATH.glob("*.asc")]
+    else:
+        EXAMPLE_ASC_FILES = []
+    return EXAMPLE_ASC_FILES

word_measures.md ADDED Viewed

	@@ -0,0 +1,58 @@

+#### Column names for Word measures
+Some features were adapted from the popEye R package ([github](https://github.com/sascha2schroeder/popEye))
+The if the column depend on a line assignment then a _ALGORITHM_NAME will be at the end of the name.
+- subject: Subject name or ID
+- trial_id: Trial ID
+- item: Item ID
+- condition: Condition (if applicable)
+- word_number: Number of word in trial
+- word_length: Number of characters in word
+- word_xmin: x-coordinate of left side of bounding box
+- word_xmax: x-coordinate of right side of bounding box
+- word_ymin: y-coordinate of top of bounding box
+- word_ymax: y-coordinate of bottom of bounding box
+- word_x_center: x-coordinate of center of bounding box
+- word_y_center: y-coordinate of center of bounding box
+- assigned_line: Line number to which the word belongs
+- word: Text of word
+- blink_ALGORITHM_NAME: Variable indicating whether there was blink directly before, during, or directly after the word was fixated
+- number_of_fixations_ALGORITHM_NAME: Number of fixations on the word during the whole trial
+- initial_fixation_duration_ALGORITHM_NAME: Duration of the initial fixation on that word
+- first_of_many_duration_ALGORITHM_NAME: Duration of the initial fixation on that word, but only if there was more than one fixation on the word
+- total_fixation_duration_ALGORITHM_NAME: Total time the word was read during the trial in ms (total reading time)
+- gaze_duration_ALGORITHM_NAME: The sum duration of all fixations inside a word until the word is exited for the first time
+- go_past_duration_ALGORITHM_NAME: Go-past time is the sum duration of all fixations from when the interest area is first entered until when it is first exited to the right, including any regressions to the left that occur during that time period
+- second_pass_duration_ALGORITHM_NAME: Second pass duration is the sum duration of all fixations inside an interest area during the second pass over that interest area.
+- initial_landing_position_ALGORITHM_NAME:
+- initial_landing_distance_ALGORITHM_NAME:
+- landing_distances_ALGORITHM_NAME:
+- number_of_regressions_in_ALGORITHM_NAME:
+- singlefix_sac_in_ALGORITHM_NAME: Incoming saccade length (in letters) for the first fixation on the word when it was fixated only once during first-pass reading
+- firstrun_nfix_ALGORITHM_NAME: Number of fixations made on the word during first-pass reading
+- singlefix_land_ALGORITHM_NAME: Landing position (letter) of the first fixation on the word when it was fixated only once during first-pass reading
+- firstrun_skip_ALGORITHM_NAME: Variable indicating whether the word was skipped during first-pass reading
+- firstfix_cland_ALGORITHM_NAME: Centered landing position of the first fixation on the word (Vitu et al., 2001: landing position - ((wordlength + 1) / 2))
+- singlefix_dur_ALGORITHM_NAME: Duration of the first fixation on the word when it was fixated only once during first-pass reading
+- firstrun_gopast_sel_ALGORITHM_NAME: Sum of all fixations on the word from the time it was entered until it was left to the right (selective go-past time: go-past time minus the time of the regression path)
+- firstfix_land_ALGORITHM_NAME: Landing position (letter) of the first fixation on the word
+- skip_ALGORITHM_NAME: Variable indicating whether the word was fixated in the trial
+- firstrun_refix_ALGORITHM_NAME: Variable indicating whether the word was refixated during first-pass reading
+- firstrun_reg_out_ALGORITHM_NAME: Variable indicating whether there was a regression from the word during first-pass reading
+- blink_ALGORITHM_NAME:
+- firstfix_sac_out_ALGORITHM_NAME: Outgoing saccade length (in letters) for the first fixation on the word
+- reread_ALGORITHM_NAME: Variable indicating whether the word was reread at least once during the trial
+- refix_ALGORITHM_NAME: Variable indicating whether the word has been refixated at least once during a trial
+- reg_in_ALGORITHM_NAME: Variable indicating whether there was at least one regression into the word
+- firstrun_dur_ALGORITHM_NAME: Time the word was read during first-pass reading (gaze duration)
+- firstfix_sac_in_ALGORITHM_NAME: Incoming saccade length (in letters) for the first fixation on the word
+- singlefix_ALGORITHM_NAME: Variable indicating whether the word was fixated only once during first-pass reading
+- firstrun_gopast_ALGORITHM_NAME: Sum of all fixations durations from the time the word was entered until it was left to the right (go-past time/regression path duration)
+- nrun_ALGORITHM_NAME: Number of times the word was reread within the trial ("reread" means that it was read again after it has been left to the left or right)
+- singlefix_cland_ALGORITHM_NAME: Centred landing position of the first fixation on the word when it was fixated only once during first-pass reading
+- reg_out_ALGORITHM_NAME: Variable indicating whether there was at least one regression from the word
+- firstfix_dur_ALGORITHM_NAME: Duration of the first fixation on the word (first fixation duration)
+- firstfix_launch_ALGORITHM_NAME: Launch site distance (incoming saccade length until the space before the word)
+- singlefix_sac_out_ALGORITHM_NAME: Outgoing saccade length (in letters) for the first fixation on the word when it was fixated only once during first-pass reading
+- firstrun_reg_in_ALGORITHM_NAME: Variable indicating whether there was a regression into the word during first-pass reading
+- singlefix_launch_ALGORITHM_NAME: Launch site distance (incoming saccade length until the space before the word) for the first fixation on the word when it was fixated only once during first-pass reading