Tips for programmers during DMI data sprints.

  • You'll be working closely with one or two interdisciplinary teams of humanities or social science scholars as well as designers. You will be the one on which the team relies to retrieve, clean, and transform (online) data. You will probably also help with analysing and visualizing the data.
  • As you have a key role in the team, make sure to clearly discuss expectations, desired output, possibilities, limits and time estimates with your team.
  • Before you get started, investigate what data and tools are already available before you decide you need more (see e.g. https://tools.digitalmethods.net and https://github.com/digitalmethodsinitiative). Ask Erik or Emile whether a particular tool or functionality already exists as not everything DMI has made has a public frontend or has been released on Github. We also know about a lot of useful external libraries and tools.
  • Coordinate with the other programmers who can best implement which functionality in which tool.
  • Make a realistic estimate of the time it will take to process or crawl new data. You will not be able to do a full interpretation or visualization before all the data is available, so decide on what type of capture is necessary early on in the project. Consider testing things out on a sample or small part of the data set.
  • Your team is expected to deliver a final project by the end of the week! If a certain job is too difficult to program (and the results are only tentatively relevant), don't do it. Discuss with your team and/or the other programmers to see what is possible in the limited amount of time you have.
  • Present example output of your program and data early to your team members to determine if the data is usable and if the format is correct.

  • When producing output files
    • Be descriptive in the filenames of your results. Try to include all settings, the git hash of your commit, any filters or transformations applied, and the date at which the file was generated (e.g. yyyymmddHHMMSS). This way one always knows how a particular result file came into being.
    • Both the designers from Density Design and the scholars who you will be working with often work with data in tabular format, so that it can be analysed in spreadsheets, OpenRefine or R, and visualized via RAW. Network data is preferably stored in GDF or GEXF so that it can be used in the network analysis software Gephi.
    • Have a sensitive data file policy (consider privacy, security, ...)
    • Encourage participants of a project to use the same software to process and analyse data. Mixing MS Excel with Libre Office has no real benefit and may, in the worst case, even cause data corruption. (Also, MS Excel often has problems with correctly displaying UTF-8 encoding, even if you set a BOM).

  • When scraping online data
    • Cache the full HTML/json/xml of the retrieved files. Apart from being able to use these files for evidentiary purposes, caching makes sure that when you forgot to parse a specific field, you will not have to retrieve it again (slow) and can use the cached files (fast).
    • Do not DOS a specific host. Respect rate limits and robots.txt, set a sensible interval between requests to the same host.

  • Check for data integrity, check for data integrity, check for data integrity.
    • Check multiple random segments of your program output (data) by spot checking it to see if it is sane and correct.
    • Make sure all numbers are correct. Calculate it one way, calculate it another way, compare, and verify.
    • Mind character sets, especially when designing new tools, make sure your program handles UTF-8 and that your output file has a BOM) set.

  • Source code is the ultimate documentation: provide your code in our Github repository and add relevant inline explanations. Release early, release often.

  • Don't panic.

Topic revision: r3 - 23 Jun 2015, ErikBorra
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback