Windows Based Crawler

Nov 15, 2022

I like Excel for Windows. The Mac version is a joke compared to what the full-blown Windows version can do with data analysis and data finagling right from the app itself. A lot of what I have been working on as of late has been trying to get data into Excel stored on OneDrive with data crawled using Playwright. The reason for this is that some of the data is small enough that building a full database isn't necessary, and is not normalized enough to just use PowerQuery.

To achieve this outcome I have used Github Actions to trigger the run. Github Actions triggers on a schedule which sends the task to Github Runner which startups a Python script. Since Github Actions has access to the root volume on the Mac Mini (don't worry the machine is dedicated to just Github Actions) I can use xlwings to launch Excel to update. Once completed it just copies the file into OneDrive or Dropbox for me to access elsewhere.

There is absolutely no difference between the hosted runner and the self-hosted runner for this example other than that it just runs on a self-hosted instance that happens to have Excel on it:

  name: Download and Upload
    on:
      schedule:
        - cron: "0 1 \* \* \*"
      push:
        branches:
          - main

    jobs:
      build:
        runs-on: self-hosted
        steps:
          - uses: actions/[email protected]
          - name: Install Dependencies
            run: |
              pyenv global 3.11
              pip3 install -r ./requirements.txt
          - name: Combine
            run: |
              python ./main.py