Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
Have you ever found yourself spending hours formatting your code so it looks just right? Have you ever caught a duplicative import statement in your code? We recommend using open source linting tools to help avoid common issues like these and save time.
-
-
-
-
Software Linting is the practice of detecting and sometimes automatically fixing stylistic, syntactical, or other programmatic issues. Linting usually involves installing standardized or opinionated libraries which allow you to quickly make code corrections. Using linting tools also can help you learn nuanced or unwritten intricacies of programming languages while you solve problems in your work.
-
-
TLDR (too long, didn’t read); Linting is a type of static analysis which can be used to instantly address many common code issues. isort provides automatic Python import statement linting. pre-commit provides an easy way to test and apply isort (in addition to other linting tools) through source control workflows.
-
-
Example: Python Code Style Linting with isort
-
-
Isort is a Python utility for linting package import statements (sorting, deduplication, etc). Isort may be used to automatically fix your import statements or test for their consistency. See the isort installation documentation for more information on getting started.
-
-
Before isort
-
-
The following Python code shows a series of import statements. There are duplicate imports and the imports are a mixture of custom (possibly local), external, and built-in packages. Isort can check this code using the command: isort <file or path> --check.
Isort can fix the code automatically using the command: isort <file or path>. After applying the fixes, notice that all packages are alphabetized and grouped by built-in, external, and custom packages.
Pre-commit is a framework which can be used to apply linting checks and fixes as git-hooks or the command line. Pre-commit includes existing hooks for many libraries, including isort. See the pre-commit installation documentation to get started.
-
-
Example .pre-commit-config.yaml Configuration
-
-
The following yaml content can be used to reference isort by pre-commit. This configuration content can be expanded to many different pre-commit hooks.
-
-
# example .pre-commit-config.yaml file leveraging isort
-# See https://pre-commit.com/hooks.html for more hooks
----
-repos:
- -repo:https://github.com/PyCQA/isort
- rev:5.10.1
- hooks:
- -id:isort
-
-
-
Example Using pre-commit Manually
-
-
Imagine we have a file, example.py, which includes the content from Before isort. Running pre-commit manually on the directory files will first automatically apply isort formatting. The second time pre-commit is run there will be no issue (pre-commit resolved it automatically).
-
-
First detecting and fixing the file:
-
-
% pre-commit run --all-files
-isort...................................Failed
-- hook id: isort
-- files were modified by this hook
-
-Fixing example.py
-
-
-
Then checking that the file was fixed:
-
-
% pre-commit run --all-files
-isort...................................Passed
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
Diagrams can be a useful way to illuminate and communicate ideas. Free-form drawing or drag and drop tools are one common way to create diagrams. With this tip of the week we introduce another option: diagrams as code (DaC), or creating diagrams by using code.
-
-
-
-
TLDR (too long, didn’t read);
-Diagrams as code (DaC) tools provide an advantage for illustrating concepts by enabling quick visual positioning, source controllable input, portability (both for input and output formats), and open collaboration through reproducibility. Consider using Mermaid (as well as many other DaC tools) to assist your diagramming efforts which can be used directly, within in your markdown files, or Github commentary using code blocks (for example, ` ```mermaid `).
-
-
Example Mermaid Diagram as Code
-
-
-
-
-
-
flowchart LR
- a --> b
- b --> c
- c --> d1
- c --> d2
-
-
-
Mermaid code
-
-
-
-
-
-flowchart LR
- a --> b
- b --> c
- c --> d1
- c --> d2
-
-
Mermaid rendered
-
-
-
-
-
-
-
The above shows an example mermaid flowchart code and its rendered output. The syntax is specific to mermaid and acts as a simple coding language to help you depict ideas. Mermaid also includes options for sequence, class, Gantt, and other diagram types. Mermaid provides a live editor which can be used to quickly draft and share content.
-
-
Mermaid Github Integration
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Mermaid diagrams may be rendered directly from markdown (.md) and text communication content (like pull request or issue comments) within Github. See Github’s blog post on mermaid for more details covering this topic.
-
-
Mermaid Jupyter Notebook Integration
-
-
-
-
Mermaid diagrams can be rendered directly within Jupyter notebooks with a small amount of additional code and a rendering service. One way to render mermaid and other diagrams within notebooks is to use Kroki.io. See this example for an interactive demonstration.
-
-
Version Controlling Your Diagrams
-
-
-graph LR
- subgraph Compose
- write[Write Diagram Code]
- render[Render Diagram]
- end
- subgraph Store[Save and Share]
- save[Upload Diagram]
- end
- write --> | create | render
- render --> | revise | write
- render --> | code and exports | save
-
-
Mermaid version control workflow example
-
-
Creating your diagrams with code means you can enable reproducible and collaborative work on version control systems (like git). Using git in this way allows you to reference and remix your diagrams as part of development. It also allows others to collaborate on diagrams together making modifications as needed.
-
-
Additional Resources
-
-
Please see the following the additional resources which are related to diagrams as code.
Tip of the Week: Data Engineering with SQL, Arrow and DuckDB
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
Apache Arrow is a language-independent and high performance data format useful in many scenarios. DuckDB is an in-process SQL-based data management system which is Arrow-compatible. In addition to providing a SQLite-like database format, DuckDB also provides a standardized and high performance way to work with Arrow data where otherwise one may be forced to language-specific data structures or transforms.
-
-
-
-
TLDR (too long, didn’t read);
-DuckDB may be used to access and transform Arrow-based data from multiple data formats through SQL. Using Arrow and DuckDB provides a cross-language way to access and manage data. Data development with these tools may also enable improvements in performance, understandability, or long term maintainability of your code.
Arrow provides a multi-language data format which prevents you from needing to convert to other formats when dealing with multiple in-memory or serialized data formats. For example, this means that a Python and an R package may use the same in-memory or file-based data without conversion (where normally a Python Pandas dataframe and R data frame may require a conversion step in between).
The same stands for various libraries within one language - Arrow enables interchange between various language library formats (for example, a Python Pandas dataframe and Python dictionary are two distinct in-memory formats which may require conversions). Conversions to or from these formats can involve data type or other inferences which are costly to productivity. You can save time and effort by avoiding conversions using Arrow.
-
-
Using SQL to Join or Transform Arrow Data via DuckDB
DuckDB provides a management client and relational database format (similar to SQLite databases) which may be handled with Arrow. SQL may be used with the DuckDB client to filter, join, or change various data types. Due to Arrow’s cross-language properties, there is no additional cost to using SQL through DuckDB to return data for implementation within other purpose-built data formats. DuckDB provides client API’s in many languages (for example, Python, R, and C++), making it possible to write DuckDB client code with SQL to manage data without having to use manually written sub-procedures.
Using SQL to perform these operations with Arrow provides an opportunity for your data code to be used (or understood) within other languages without additional rewrites. SQL also provides you access to roughly 48 years worth of data management improvements without being constrained by imperative language data models or schema (reference: SQL Wikipedia: First appeared: 1974).
-
-
Example with SQL to Join Arrow Data with DuckDB in Python
-
-
-
-
The following example notebook shows how to use SQL to join data from multiple sources using the DuckDB client API within Python. The example includes DuckDB querying a remote CSV, local Parquet file, and Arrow in-memory tables.
Tip of the Week: Remove Unused Code to Avoid Software Decay
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
The act of creating software often involves many iterations of writing, personal collaborations, and testing. During this process it’s common to lose awareness of code which is no longer used, and thus may not be tested or otherwise linted. Unused code may contribute to “software decay”, the gradual diminishment of code quality or functionality. This post will cover software decay and strategies for addressing unused code to help keep your code quality high.
-
-
-
-
TLDR (too long, didn’t read);
-Unused code is easy to amass and may cause your code quality or code functionality to diminish (“decay”) over time. Effort must be taken to maintain any code or artifacts you add to your repositories, including those which are unused. Consider using Vulture, Pylint, or Coverage to help illuminate sections of your code which may need to be removed.
-
-
Code Lifecycle and Maintenance
-
-
-stateDiagram
- direction LR
- removal : removed or archived
- changes : changes needed
- [*] --> added
- added --> maintenance
- state maintenance {
- direction LR
- updated --> changes
- changes --> updated
- }
- maintenance --> removal
- removal --> [*]
-
-
-
-
Diagram showing code lifecycle activities.
-
-
Adding code to a project involves a loose agreement to maintenance for however long the code is available. The maintenance of the code can involve added efforts in changes as well as passive impacts like longer test durations or decreased readability (simply from more code).
-
-
-
-
-
-
-
-
-
-
-
-
When considering multiple parts of code in many files, this maintenance can become untenable, leading to the gradual decay of your code quality or functionality. For example, let’s assume one line of code costs 30 seconds to maintain (feel free to substitute time with monetary or personnel aspects as an example measure here too). 1000 lines of code would cost 500 minutes (or about 8 hours) to maintain. This becomes more complex when considering multiple files, collaborators, or languages.
-
-
-
-
Think about your project as if it were on a hiking trail: “Carry as little as possible, but choose that little with care.” (Earl Shaffer). Be careful what code you choose to carry; it may impact your ability to address needs over time and lead to otherwise unintended software decay.
-
-
Detecting Unused Code with Vulture
-
-
Understanding the cost of added content, it’s important to routinely examine which parts of your code are still necessary. You can prepare your code for a long journey by detecting (and removing) unused code with various automated tools. These tools are generally designed for static analysis and linting, meaning they may also be incorporated into automated and routine testing.
Example of Vulture command line usage to discover unused code.
-
-
Vulture is one tool dedicated to finding unused python code. Vulture provides both a command line interface and Python API for discovering unused code. It also provide a rough confidence to show how certain it was about whether the block of code was unused. See the following interactive example for a demonstration of using Vulture.
Further Code Usefulness Detection with Pylint and Coverage.py
-
-
In addition to Vulture, Pylint and Coverage.py can be used in a similar way to help show where code may not have been used within your project.
-
-
Pylint focuses on code style and other static analysis in addition to unused variables. See Pylint’s Checkers page for more details here, using “unused-*” as a reference to checks it performs which focus on unused code.
-
-
Coverage.py helps show you which parts of your code have been executed or not. A common usecase for Coverage involves measuring “test coverage”, or which parts of your code are executed in relationship to tests written for that code. This provides another perspective on code utility: if there’s not a test for the code, is it worth keeping?
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
Software documentation is sometimes treated as a less important or secondary aspect of software development. Treating documentation as code allows developers to version control the shared understanding and knowledge surrounding a project. Leveraging this paradigm also enables the use of tools and patterns which have been used to strengthen code maintenance. This article covers one such pattern: linting, or static analysis, for documentation treated like code.
-
-
-
-
TLDR (too long, didn’t read);
-There are many linting tools available which enable quick revision of your documentation. Try using codespell for spelling corrections, mdformat for markdown file formatting corrections, and vale for more complex editorial style or natural language assessment within your documentation.
-
-
Spelling Checks
-
-
-
-
-
-
<!--- readme.md --->
-## Example Readme
-
-Thsi project is a wokr in progess.
-Code will be updated by the team very often.
-
-(CU Anschutz)[https://www.cuanschutz.edu/]
-
Example showing codespell detection of mispelled words
-
-
-
-
-
Spelling checks may be used to automatically detect incorrect spellings of words within your documentation (and code!). Codespell is one library which can lint your word spelling. Codespell may be used through the command-line and also through a pre-commit hook.
-
-
Markdown Format Linting
-
-
-
-
-
-
<!--- readme.md --->
-## Example Readme
-
-This project is a work in progress.
-Code will be updated by the team very often.
-
-(CU Anschutz)[https://www.cuanschutz.edu/]
-
-
-
Example readme.md with markdown issues
-
-
-
-
-
% markdownlint readme.md
-readme.md:2 MD041/first-line-heading/first-line-h1
-First line in a file should be a top-level heading
-[Context: "## Example Readme"]
-readme.md:6:5 MD011/no-reversed-links Reversed link
-syntax [(link)[https://www.cuanschutz.edu/]]
-
-
-
-
Example showing markdownlint detection of issues
-
-
-
-
-
The format of your documentation files may also be linted for common issues. This may catch things which are otherwise hard to see when editing content. It may also improve the overall web accessibility of your content, for example, through proper HTML header order and image alternate text. Markdownlint is one library which can be used to find issues within markdown files.
-
-
Additional and similar resources to explore in this area:
<!--- readme.md --->
-# Example Readme
-
-This project is a work in progress.
-Code will be updated by the team very often.
-
-[CU Anschutz](https://www.cuanschutz.edu/)
-
-
-
Example readme.md with questionable editorial style
-
-
-
-
-
% vale readme-example.md
-readme-example.md
-2:12 error Did you really mean 'Readme'? Vale.Spelling
-5:11 warning 'be updated' may be passive write-good.Passive
- voice. Use active voice if you
- can.
-5:34 warning 'very' is a weasel word! write-good.Weasel
-
-
-
Example showing vale warnings and errors
-
-
-
-
-
Maintaining consistent editorial style and grammar may also be a focus within your documentation. These issues are sometimes more difficult to detect and more opinionated in nature. In some cases, organizations publish guides on this topic (see Microsoft Writing Style Guide, or Google Developer Documenation Style Guide). Some of the complexity of writing style may be linted through tools like Vale. Using common configurations through Vale can unify how language is used within your documentation by linting for common style and grammar.
-
-
Additional and similar resources to explore in this area:
-
-
-
-textlint - similar to Vale with a modular approach
-
-
-
Resources
-
-
Please see the following the resources on this topic.
-
-
-
-codespell - a code and documentation spell checker.
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
Programming often involves long periods of problem solving which can sometimes lead to unproductive or exhausting outcomes. This article covers one way to avoid less productive time expense or protect yourself from overexhaustion through a technique called “timeboxing” (also sometimes referenced as “timeblocking”).
-
-
-
-
TLDR (too long, didn’t read);
-Use timeboxing techniques such as Pomodoro® or 52/17 to help modularize your software work to ensure you don’t fall victim to Parkinson’s Law. Timeboxing may also map well to Github Issues, which allows your software tasks to be further aligned, documented, and chunked in collaboration with others.
-
-
Controlling Work Time Expansion
-
-
-
-
Have you ever spent more time than you thought you would on a task? An adage which helps explain this phenomenon is Parkinson’s Law:
-
-
-
“… work expands so as to fill the time available for its completion.”
-
-
-
The practice of writing software is not protected from this “law”. It may be affected by it in sometimes worse ways during long periods of uninterrupted programming where we may have an inclination to forget productive goals.
-
-
One way to address this is through the use of timeboxing techiques. Timeboxing sets a fixed limit to the amount of time one may spend on a specific activity. One can use timeboxing to systematically address many tasks, for example, as with the Pomodoro® Technique (developed by Francesco Cirillo) or 52/17 rule. While there are many ways to apply timeboxing, make sure to balance activity with short breaks and focus switches to help ensure we don’t become overwhelmed.
-
-
Timeboxing Means Modularization
-
-
Timeboxing has an auxiliary benefit of framing your work as objective and oftentimes smaller chunks (we have to know what we’re timeboxing in order to use this technique). Creating distinct chunks of work applies for both our daily time schedule as well as code itself. This concept is more broadly called “modularization” and helps to distinguish large portions of work (whether in real life or in code) as smaller and more maintainable chunks.
-
-
-
-
-
-
# Goals
-- Finish writing paper
-
-
-
-
-
-
-
Vague and possibly large task
-
-
-
-
-
-
# Goals
-- Finish writing paper
- - Create paper outline
- - Finish writing introduction
- - Check for dead hyperlinks
- - Request internal review
-
-
-
Modular and more understandable tasks
-
-
-
-
-
Breaking down large amounts of work as smaller chunks within our code helps to ensure long-term maintainability and understandability. Similarly, keeping our tasks small can help ensure our goals are achievable and understandable (to ourselves or others). Without this modularity, tasks can be impossible to achieve (subjective in nature) or very difficult to understand. Stated differently, taking many small steps can lead to a big change in an organized, oftentimes less exhausting way (related graphic).
List of example version control repository issues with associated time duration.
-
-
The parallels between the time we give a task and related code can work towards your benefit. For example, Github Issues can be created to outline a timeboxed task which relates to a distinct chunk of code to be created, updated, or fixed. Once development tasks have been outlined as issues, a developer can use timeboxing to help organize how much time to allocate on each issue.
-
-
Using Github Issues in this way provides a way to observe task progress associated with one or many repositories. It also increases collaborative opportunities for task sizing and description. For example, if a task looks too large to complete in a reasonable amount of time, developers may work together to break the task down into smaller modules of work.
-
-
Be Kind to Yourself: Take Breaks
-
-
While timeboxing is often a conversation about how to be more productive, it’s also worth remembering: take breaks to be kind to yourself and more effective. Some studies and thought leadership have shown that taking breaks may be necessary to avoid performance decreases and impacts to your health. There’s also some indication that taking breaks may lead to better work. See below for just a few examples:
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
This article covers using the software technique of linting on R code in order to improve code quality, development velocity, and collaboration.
-
-
-
-
TLDR (too long, didn’t read);
-Use software linting (static analysis) practices on your R code with existing packages lintr and styler (among others). These linters may be applied using pre-commit in your local development environment or as continuous tests using for example Github Actions.
-
-
Treating R as Software
-
-
-
“Many users think of R as a statistics system. We prefer to think of it as an environment within which statistical techniques are implemented.”
The R programming language is sometimes treated as only a statistics system instead of software. This treatment can sometimes lead to common issues in development which are experienced in other languages. Addressing R as software enables developers to enhance their work by taking benefit from existing concepts applied to many other languages.
-
-
Linting with R
-
-
-flowchart LR
- write\[Write R code] --> |check| check\[Check code with linters]
- check --> |revise| write
-
-
-
-
Workflow loop depicting writing R code and revising with linters.
-
-
Software linting, or static analysis, is one way to ensure a minimum level of code quality without writing new tests. Linting checks how your code is structured without running it to make sure it abides by common language paradigms and logical structures. Using linting tools allows a developer to gain quick insights about their code before it is viewed or used by others.
-
-
One way to lint your R code is by using the lintr package. The lintr package is also complementary of the styler pacakge, which formats the syntax of R code in a consistent way. Both of these can be used independently or as part of continuous quality checks for R code repositories.
-
-
Automated Linting Checks with R
-
-
-flowchart LR
- subgraph development
- write
- check
- end
- subgraph linters
- direction LR
- lintr
- styler
- end
- check <-.- linters
- write\[Write R code] --> |check| check\[Check code with pre-commit]
- check --> |revise| write
-
-
-
Workflow showing development with pre-commit using multiple linters.
-
-
lintr and styler can be incorporated into automated checks to help make sure linting (or other steps) are always used with new code. One tool which can help with this is pre-commit, which acts as both a local development tool in addition to providing observability within source control (more on this later).
-
-
Using pre-commit locally enables quick feedback loops using one or many checkers (such as lintr, styler, or others). Pre-commit may be used through the use of git hooks or manually using pre-commit run ... from a command-line. See this example of pre-commit checks with R for an example of multiple pre-commit checks for R code.
-
-
Continuous and Observable Testing for R
-
-
-flowchart LR
- subgraph development [local development]
- direction LR
- write
- check
- commit
- end
- subgraph remote[Github repository]
- direction LR
- action["Check code (remotely)"]
- end
- write\[Write R code] --> |check| check\[Check code with pre-commit]
- check --> |revise| write
- check --> commit[commit + push]
- commit --> |optional trigger| action
- check -.-> |perform same checks| action
-
-
-
Workflow showing pre-commit used as continuous testing tool with Github.
-
-
Pre-commit linting checks can also be incorporated into continuous testing performed on your repository. One way to do this is using Github Actions. Github Actions provides a programmatic way to specify automatic steps taken as changes occur to a repository.
-
-
Pre-commit provides an example Github Action which will automatically check and alert repository maintainers when code challenges are detected. Using pre-commit in this way allows R developers to ensure lintr checks are performed on any new work checked into a repository. This can have benefits towards decreasing pull request (PR) review time and standardize how code collaboration takes place for R developers.
-
-
Resources
-
-
Please see the following the resources on this topic.
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
Git provides a feature called branching which facilitates parallel and segmented programming work through commits with version control. Using branching enables both work concurrency (multiple people working on the same repository at the same time) as well as a chance to isolate and review specific programming tasks. This article covers some conceptual best practices with branching, reviewing, and merging code using Github.
-
-
-
-
Please note: the content below represents one opinion in a larger space of Git workflow concepts (it’s not perfect!). Developer cultures may vary on these topics; be sure to acknowledge people and culture over exclusive or absolute dedication to what is found below.
-flowchart LR
- subgraph Course
- direction LR
- open["open\nassignment"]
- turn_in["review\nassignment"]
- end
- subgraph Student [" Student"]
- direction LR
- work["completed\nassignment"]
- end
- open -.-> turn_in
- open --> |works towards| work
- work --> |seeks review| turn\_in
-
-
-
-
An example course and student assignment workflow.
-
-
Git branching practices may be understood in context with similar workflows from real life. Consider a student taking a course, where an assignment is given to them to complete. In addition to the steps shown in the diagram above, it’s important to think about why this pattern is beneficial:
-
-
-
Completing an assignment allows us as social, inter-dependent beings to present new findings which enable learning and amalgamation of additional ideas from others.
-
The timebound nature of assignments enables us to practice some form of timeboxing so as to minimize tasks which may take too much time.
-
Segmenting applied learning in distinct, goal-orientated chunks helps make larger topics easier to understand.
An example git diagram showing assignment branch based off main.
-
-
Following the course assignment workflow, the diagram above shows an in-progress assignment branch based off of the main branch. When the assignment branch is created, we bring into it everything we know from main (the course) so far in the form of commits, or groups of changes to various files. Branching allows us to make consistent and well described changes based on what’s already happened without impacting others work in the meantime.
-
-
-
Branching best practices:
-
-
-
-Keep the name and work with branches dedicated to a specific and focused purpose. For example: a branch named fix-links-in-docs might entail work related to fixing HTTP links within documentation.
-
-Consider the use of Github Forks (along with branches within the fork) to help further isolate and enrich work potential. Forks also allow remixing existing work into new possibilities.
-
-festina lente or “make haste, slowly”: Commits on any branch represent small chunks of a cohesive idea which will eventually be brought to main. It is often beneficial to be consistent with small, gradual commits to avoid a rushed or incomplete submission. The same applies more generally for software; taking time upfront to do things well can mean time saved later.
An example git diagram showing assignment branch being merged with main after a review.
-
-
The diagram above depicts a merge from the assignment branch to pull the changes into the main branch, simulating an assignment being returned for review within a course. While merges may be forced without review, it’s a best practice create a Pull Request (PR) Review (also known as a Merge Request (MR) on some systems) and then ask other members of your team to review it. Doing this provides a chance to make revisions before code changes are “finalized” within the main branch.
-
-
-
Github provides special tools for reviews which can assist both the author and reviewer:
-
-
-
-Keep code changes intended for review small, enabling reviewers to reason through the work to more quickly provide feedback and practicing incremental continuous improvement (it may be difficult to address everything at once!). This also may denote the git history for a repository in a clearer way.
-
-Github comments:Overall review comments (encompassing all work from the branch) and Inline comments (inquiring about individual lines of code) may be provided. Inline comments may also include code suggestions, which allows for code-based revision suggestions that may be committed directly to the branch using markdown codeblocks (``suggestion `).
-
-Github issues:Creating issues from comments allows the creation of new repository issues to address topics outside of the current PR.
An example git diagram showing the main branch after the assignment branch has been merged (and removed).
-
-
Changes may be made within the assignment branch until the work is in a state where the authors and reviewers are satisfied. At this point, the branch changes may be merged into main. Approvals are sometimes provided informally (for ex., with a comment: “LGTM (looks good to me)!”) or explicitly (for ex., approvals within Github) to indicate or enable branch merge readiness . After the merge, changes may continue to be made in a similar way (perhaps accounting for concurrently branched work elsewhere). Generally, a merged branch may be removed afterwards to help maintain an organized working environment (see Github PR branch removal).
-
-
-
Github provides special tools for merging:
-
-
-
-Decide which merge strategy is appropriate (there are many!): There are many merge strategies within Github (merge commits, squash merges, and rebase merging). Take time to understand them and choose which one works best.
-
-Consider using branch protection to automate merge requirements: The main or other branches may be “protected” against merges using branch protection rules. These rules can require reviewer approvals or automatic status checks to pass before changes may be merged.
-
-Use merge queuing to manage multiple PR’s: When there are many unmerged PR’s, it can sometimes be difficult to document and ensure each are merged in a desired sequence. Consider using merge queues to help with this process.
-
-
-
-
Additional Resources
-
-
The links below may provide additional guidance on using these git features, including in-depth coverage of various features and related configuration.
Tip of the Week: Automate Software Workflows with GitHub Actions
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
There are many routine tasks which can be automated to help save time and increase reproducibility in software development. GitHub Actions provides one way to accomplish these tasks using code-based workflows and related workflow implementations. This type of automation is commonly used to perform tests, builds (preparing for the delivery of the code), or delivery itself (sending the code or related artifacts where they will be used).
-flowchart LR
- start((start)) --> action
- action["action(s)"] --> en((end))
- style start fill:#6EE7B7
- style en fill:#FCA5A5
-
-
-
-
An example workflow.
-
-
Workflows consist of sequenced activities used by various systems. Software development workflows help accomplish work the same way each time by using what are commonly called “workflow engines”. Generally, workflow engines are provided code which indicate beginnings (what triggers a workflow to begin), actions (work being performed in sequence), and an ending (where the workflow stops). There are many workflow engines, including some which help accomplish work alongside version control.
-
-
GitHub Actions
-
-
-flowchart LR
- subgraph workflow [GitHub Actions Workflow Run]
- direction LR
- action["action(s)"] --> en((end))
- start((event\ntrigger))
- end
- start --> action
- style start fill:#6EE7B7
- style en fill:#FCA5A5
-
-
-
A diagram showing GitHub Actions as a workflow.
-
-
GitHub Actions is a feature of GitHub which allows you to run workflows in relation to your code as a continuous integration (including automated testing, builds, and deployments) and general automation tool. For example, one can use GitHub Actions to make sure code related to a GitHub Pull Request passes certain tests before it is allowed to be merged. GitHub Actions may be specified using YAML files within your repository’s .github/workflows directory by using syntax specific to Github’s workflow specification. Each YAML file under the .github/workflows directory can specify workflows to accomplish tasks related to your software work. GitHub Actions workflows may be customized to your own needs, or use an existing marketplace of already-created Actions.
-
-
-
-
GitHub provides an “Actions” tab for each repository which helps visualize and control Github Actions workflow runs. This tab shows a history of all workflow runs in the repository. For each run, it shows whether it was run successful or not, the associated logs, and controls to cancel or re-run it.
-
-
-
GitHub Actions Examples
-GitHub Actions is sometimes better understood with examples. See the following references for a few basic examples of using GitHub Actions in a simulated project repository.
-
-
-
-1.example-action.yml: demonstrates how to run a snippet of Python code in a basic GitHub Actions workflow.
-
-2.run-python-file.yml: demonstrates how to reliably reproduce the environment by installing dependencies using Poetry, and then run a Python file in that environment.
-flowchart LR
- subgraph container ["local simulation container(s)"]
- direction LR
- subgraph workflow [GitHub Actions Workflow Run]
- direction LR
- start((event\ntrigger))
- action --> en((end))
- end
- end
- start --> action
- act\[Run Act] -.-> |Simulate\ntrigger| start
- style start fill:#6EE7B7
- style en fill:#FCA5A5
-
-
-
A diagram showing how GitHub Actions workflows may be triggered from Act
-
-
One challenge with GitHub Actions is a lack of standardized local testing tools. For example, how will you know that a new GitHub Actions workflow will function as expected (or at all) without pushing to the GitHub repository? One third-party tool which can help with this is Act. Act uses Docker images which require Docker Desktop to simulate what running a GitHub Action workflow within your local environment. Using Act can sometimes avoid guessing what will occur when a GitHub Action worklow is added to your repository. See Act’s installation documentation for more information on getting started with this tool.
-
-
Nested Workflows with GitHub Actions
-
-
-flowchart LR
-
- subgraph action ["Nested Workflow (Dagger, etc)"]
- direction LR
- actions
- start2((start)) --> actions
- actions --> en2((end))
- en2((end))
- end
- subgraph workflow2 [Local Environment Run]
- direction LR
- run2[run workflow]
- en3((end))
- start3((event\ntrigger))
- end
- subgraph workflow [GitHub Actions Workflow Run]
- direction LR
- start((event\ntrigger))
- run[run workflow]
- en((end))
- end
-
- start --> run
- start3 --> run2
- action -.-> run
- run --> en
- run2 --> en3
- action -.-> run2
- style start fill:#6EE7B7
- style start2 fill:#D1FAE5
- style start3 fill:#6EE7B7
- style en fill:#FCA5A5
- style en2 fill:#FFE4E6
- style en3 fill:#FCA5A5
-
-
-
A diagram showing how GitHub Actions may leverage nested workflows with tools like Dagger.
-
-
There are times when GitHub Actions may be too constricting or Act may not accurately simulate workflows. We also might seek to “write once, run anywhere” (WORA) to enable flexible development on many environments. One workaround to this challenge is to use nested workflows which are compatible with local environments and GitHub Actions environments. Dagger is one tool which enables programmatically specifying and using workflows this way. Using Dagger allows you to trigger workflows on your local machine or GitHub Actions with the same underlying engine, meaning there are fewer inconsistencies or guesswork for developers (see here for an explanation of how Dagger works).
-
-
There are also other alternatives to Dagger you may want to consider based on your usecase, preference, or interest. Earthly is similar to Dagger and uses “earthfiles” as a specification. Both Dagger and Earthly (in addition to GitHub Actions) use container-based approaches, which in-and-of themselves present additional alternatives outside the scope of this article.
-
-
-
GitHub Actions with Nested Workflow Example
-Reference this example for a brief demonstration of how GitHub Actions and Dagger may be used together.
-
-
-
-4.run-matrixed-pytest-dagger.yml: demonstrates how to run matrixed Python versions for confirming passing pytest tests using GitHub Actions and Dagger together. A GitHub Actions matrix strategy is used to span concurrent work while retaining the reproducibility from Dagger task specification.
-
-
-
-
Closing Remarks
-
-
Using GitHub Actions through the above methods can help automate your technical work and increase the quality of your code with sometimes very little additional effort. Saving time through this form of automation can provide additional flexibility accomplish more complex work which requires your attention (perhaps using timeboxing techniques). Even small amounts of time saved can turn into large opportunities for other work. On this note, be sure to explore how GitHub Actions can improve things for your software endeavors.
Tip of the Week: Using Python and Anaconda with the Alpine HPC Cluster
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
Diagram showing common benefits of Alpine and HPC clusters.
-
-
Alpine is a High Performance Compute (HPC) cluster.
-HPC environments provide shared computer hardware resources like memory, CPU, GPU or others to run performance-intensive work.
-Reasons for using Alpine might include:
-
-
-
-Compute resources: Leveraging otherwise cost-prohibitive amounts of memory, CPU, GPU, etc. for processing data.
-
-Long-running jobs: Completing long-running processes which may take hours or days to complete.
-
-Collaborations: Sharing a single implementation environment for reproducibility within a group (avoiding “works on my machine” inconsistency issues).
-
-
-
How does Alpine work?
-
-
-
-
Diagram showing high-level user workflow and Alpine components.
-
-
Alpine’s compute resources are used through compute nodes in a system called Slurm.
-Slurm is a system that a large number of users to run jobs on a cluster of computers; the system figures out how to use all the computers in the cluster to execute all the user’s jobs fairly (i.e., giving each user approximately equal time and resources on the cluster). A job is a request to run something, e.g. a bash script or a program, along with specifications about how much RAM and CPU it needs, how long it can run, and how it should be executed.
-
-
Slurm’s role in general is to take in a job (submitted via the sbatch command) and put it into a queue (also called a “partition” in Slurm). For each job in the queue, Slurm constantly tries to find a computer in the cluster with enough resources to run that job, then when an available computer is found runs the program the job specifies on that computer. As the program runs, Slurm records its output to files and finally reports the program’s exit status (either completed or failed) back to the job manager.
-
-
Importantly, jobs can either be marked as interactive or batch. When you submit an interactive job, sbatch will pause while waiting for the job to start and then connect you to the program, so you can see its output and enter commands in real time. On the other hand, a batch job will return immediately; you can see the progress of your job using squeue, and you can typically see the output of the job in the folder from which you ran sbatch unless you specify otherwise.
-Data for or from Slurm work may be stored temporarily on local storage or on user-specific external (remote) storage.
-
-
-
-
-
-
-
Wait, what are “nodes”?
-
-
A simplified way to understand the architecture of Slurm on Alpine is through login and compute “nodes” (computers).
-Login nodes act as a place to prepare and submit jobs which will be completed on compute nodes. Login nodes are never used to execute Slurm jobs, whereas compute nodes are exclusively accessed via a job.
-Login nodes have limited resource access and are not recommended for running procedures.
-
-
-
-
-
One can interact with Slurm on Alpine by use of Slurm interfaces and directives.
-A quick way of accessing Alpine resources is through the use of the acompile command, which starts an interactive job on a compute node with some typical default parameters for the job. Since acompile requests very modest resources (1 hour and 1 CPU core at the time of writing), you’ll typically quickly be connected to a compute node. For more intensive or long-lived interactive jobs, consider using sinteractive, which allows for more customization: Interactive Jobs.
-One can also access Slurm directly through various commands on Alpine.
Using Alpine effectively involves knowing how to leverage Slurm.
-A simplified way to understand how Slurm works is through the following sequence.
-Please note that some steps and additional complexity are omitted for the purposes of providing a basis of understanding.
-
-
-
-Create a job script: build a script which will configure and run procedures related to the work you seek to accomplish on the HPC cluster.
-
-Submit job to Slurm: ask Slurm to run a set of commands or procedures.
-
-Job queue: Slurm will queue the submitted job alongside others (recall that the HPC cluster is a shared resource), providing information about progress as time goes on.
-
-Job processing: Slurm will run the procedures in the job script as scheduled.
-
-Job completion or cancellation: submitted jobs eventually may reach completion or cancellation states with saved information inside Slurm regarding what happened.
-
-
-
How do I store data on Alpine?
-
-
-
-
Data used or produced by your processed jobs on Alpine may use a number of different data storage locations.
-Be sure to follow the Acceptable data storage and use policies of Alpine, avoiding the use of certain sensitive information and other items.
-These may be distinguished in two ways:
-
-
-
-
Alpine local storage (sometimes temporary): Alpine provides a number of temporary data storage locations for accomplishing your work.
-⚠️ Note: some of these locations may be periodically purged and are not a suitable location for long-term data hosting (see here for more information)!
-Storage locations available (see this link for full descriptions):
-
-
-
-Home filesystem: 2 GB of backed up space under /home/$USER (where $USER is your RMACC or Alpine username).
-
-Projects filesystem: 250 GB of backed up space under /projects/$USER (where $USER is your RMACC or Alpine username).
-
-Scratch filesystem: 10 TB (10,240 GB) of space which is not backed up under /scratch/alpine/$USER (where $USER is your RMACC or Alpine username).
-
-
-
-
External / remote storage: Users are encouraged to explore external data storage options for long-term hosting.
-Examples may include the following:
-
-
-
-PetaLibrary: subsidized external storage host from University of Colorado Boulder’s Research Computing (requires specific arrangements outside of Alpine).
-Others: additional options include third-party “storage as a service” offerings like Google Drive or Dropbox and/or external servers maintained by other groups.
-
-
-
-
-
How do I send or receive data on Alpine?
-
-
-
-
Diagram showing external data storage being used to send or receive data on Alpine local storage.
-
-
Data may be sent to or gathered from Alpine using a number of different methods.
-These may vary contingent on the external data storage being referenced, the code involved, or your group’s available resources.
-Please reference the following documentation from the University of Colorado Boulder’s Research Computing regarding data transfers: The Compute Environment - Data Transfer.
-Please note: due to the authentication configuration of Alpine many local or SSH-key based methods are not available for CU Anschutz users.
-As a result, Globus represents one of the best options available (see 3. 📂 Transfer data results below). While the Globus tutorial in this document describes how you can download data from Alpine to your computer, note that you can also use Globus to transfer data to Alpine from your computer.
-
-
Implementation
-
-
-
-
Diagram showing how an example project repository may be used within Alpine through primary steps and processing workflow.
-
-
Use the following steps to understand how Alpine may be used with an example project repository to run example Python code.
-
-
0. 🔑 Gain Alpine access
-
-
First you will need to gain access to Alpine.
-This access is provided to members of the University of Colorado Anschutz through RMACC and is separate from other credentials which may be provided by default in your role.
-Please see the following guide from the University of Colorado Boulder’s Research Computing covering requesting access and generally how this works for members of the University of Colorado Anschutz.
[username@xsede.org@login-ciX ~]$ cd /projects/$USER
-[username@xsede.org@login-ciX username@xsede.org]$ git clone https://github.com/CU-DBMI/example-hpc-alpine-python
-Cloning into 'example-hpc-alpine-python'...
-... git output ...
-[username@xsede.org@login-ciX username@xsede.org]$ ls-l example-hpc-alpine-python
-... ls output ...
-
-
-
An example of what this preparation section might look like in your Alpine terminal session.
-
-
Next we will prepare our code within Alpine.
-We do this to balance the fact that we may develop and source control code outside of Alpine.
-In the case of this example work, we assume git as an interface for GitHub as the source control host.
-
-
Below you’ll find the general steps associated with this process.
Change directory into the Projects filesystem (generally we’ll assume processed data produced by this code are large enough to warrant the need for additional space): cd /projects/$USER
-
-
Use git (built into Alpine by default) commands to clone this repo: git clone https://github.com/CU-DBMI/example-hpc-alpine-python
-
-
Verify the contents were received as desired (this should show the contents of an example project repository): ls -l example-hpc-alpine-python
-
-
-
-
-
-
-
-
-
-
-
-
-
What if I need to authenticate with GitHub?
-
-
There are times where you may need to authenticate with GitHub in order to accomplish your work.
-From a GitHub perspective, you will want to use either GitHub Personal Access Tokens (PAT) (recommended by GitHub) or SSH keys associated with the git client on Alpine.
-Note: if you are prompted for a username and password from git when accessing a GitHub resource, the password is now associated with other keys like PAT’s instead of your user’s password (reference).
-See the following guide from GitHub for more information on how authentication through git to GitHub works:
[username@xsede.org@login-ciX ~]$ sbatch --export=CSV_FILEPATH="/projects/$USER/example_data.csv" example-hpc-alpine-python/run_script.sh
-[username@xsede.org@login-ciX username@xsede.org]$ tail-f example-hpc-alpine-python.out
-... tail output (ctrl/cmd + c to cancel) ...
-[username@xsede.org@login-ciX username@xsede.org]$ head-n 2 example_data.csvexample-hpc-alpine-python
-... data output ...
-
-
-
An example of what this implementation section might look like in your Alpine terminal session.
-
-
After our code is available on Alpine we’re ready to run it using Slurm and related resources.
-We use Anaconda to build a Python environment with specified packages for reproducibility.
-The main goal of the Python code related to this work is to create a CSV file with random data at a specified location.
-We’ll use Slurm’s sbatch command, which submits batch scripts to Slurm using various options.
-
-
-
Use the sbatch command with exported variable CSV_FILEPATH. sbatch --export=CSV_FILEPATH="/projects/$USER/example_data.csv" example-hpc-alpine-python/run_script.sh
-
-
After a short moment, use the tail command to observe the log file created by Slurm for this sbatch submission. This file can help you understand where things are at and if anything went wrong. tail -f example-hpc-alpine-python.out
-
-
Once you see that the work has completed from the log file, take a look at the top 2 lines of the data file using the head command to verify the data arrived as expected (column names with random values): head -n 2 example_data.csv
-
-
-
-
3. 📂 Transfer data results
-
-
-
-
Diagram showing how example_data.csv may be transferred from Alpine to a local machine using Globus solutions.
-
-
Now that the example data output from the Slurm work is available we need to transfer that data to a local system for further use.
-In this example we’ll use Globus as a data transfer method from Alpine to our local machine.
-Please note: always be sure to check data privacy and policy which change the methods or storage locations you may use for your data!
During installation, you will be prompted to login to Globus. Use your ACCESS credentials to login.
-
During installation login, note the label you provide to Globus. This will be used later, referenced as “Globus Connect Personal label”.
-
Ensure you add and (importantly:) provide write access to a local directory via Globus Connect Personal - Preferences - Access where you’d like the data to be received from Alpine to your local machine.
-Configure File Manager left side (source selection)
-
-
Within the Globus web interface on the File Manager tab, use the Collection input box to search or select “CU Boulder Research Computing ACCESS”.
-
Within the Globus web interface on the File Manager tab, use the Path input box to enter: /projects/your_username_here/ (replacing “your_username_here” with your username from Alpine, including the “@” symbol if it applies).
-
-
-
-Configure File Manager right side (destination selection)
-
-
Within the Globus web interface on the File Manager tab, use the Collection input box to search or select the __Globus Connect Personal label you provided in earlier steps.
-
Within the Globus web interface on the File Manager tab, use the Path input box to enter the local path which you made accessible in earlier steps.
-
-
-
-Begin Globus transfer
-
-
Within the Globus web interface on the File Manager tab on the left side (source selection), check the box next to the file example_data.csv.
-
Within the Globus web interface on the File Manager tab on the left side (source selection), click the “Start ▶️” button to begin the transfer from Alpine to your local directory.
-
After clicking the “Start ▶️” button, you may see a message in the top right with the message “Transfer request submitted successfully”. You can click the link to view the details associated with the transfer.
-
After a short period, the file will be transferred and you should be able to verify the contents on your local machine.
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
Python packaging is the craft of preparing for and reaching distribution of your Python work to wider audiences. Following conventions for packaging help your software work become more understandable, trustworthy, and connected (to others and their work). Taking advantage of common packaging practices also strengthens our collective superpowers: collaboration. This post will cover preparation aspects of packaging, readying software work for wider distribution.
The practice of Python packaging efforts is similar to that of publishing a book. Consider how a bag of text is different from a book. How and why are these things different?
-
-
-
A book has commonly understood sequencing of content (i.e. copyright page, then title page, then body content pages…).
-
A book often cites references and acknowledges other work explicitly.
-
A book undergoes a manufacturing process which allows the text to be received in many places the same way.
-
-
-
-
-
These can be thought of metaphors when it comes to packaging in Python. Books have a smell which sometimes comes from how it was stored, treated, or maintained. While there are pleasant book smells, they might also smell soggy from being left in the rain or stored without maintenance for too long. Just like books, software can sometimes have negative code smells indicating a lack of care or less sustainable condition. Following good packaging practices helps to avoid unwanted code smells while increasing development velocity, maintainability of software through understandability, trustworthiness of the content, and connection to other projects.
-
-
-
-
-
-
-
Note: these techniques can also work just as well for inner source collaboration (private or proprietary development within organizations)! Don’t hesitate to use these on projects which may not be public facing in order to make development and maintenance easier (if only for you).
A Python package is a collection of modules (.py files) that usually include an “initialization file” __init__.py. This post will cover the craft of packaging which can include one or many packages.
Python Packaging today generally assumes a specific directory design.
-Following this convention generally improves the understanding of your code. We’ll cover each of these below.
The README.md file is a markdown file with documentation including project goals and other short notes about installation, development, or usage. The README.md file is akin to a book jacket blurb which quickly tells the audience what the book will be about.
-
The LICENSE.txt file is a text file which indicates licensing details for the project. It often includes information about how it may be used and protects the authors in disputes. The LICENSE.txt file can be thought of like a book’s copyright page. See https://choosealicense.com/ for more details on selecting an open source license.
-
The pyproject.toml file is a Python-specific TOML file which helps organize how the project is used and built for wider distribution. The pyproject.toml file is similar to a book’s table of contents, index, and printing or production specification.
The docs directory is used for in-depth documentation and related documentation build code (for example, when building documentation websites, aka “docsites”). The docs directory includes information similar to a book’s “study guide”, providing content surrounding how to best make use of and understand the content found within.
-
The src directory includes primary source code for use in the project. Python projects generally use a nested package directory with modules and sub-packages. The src directory is like a book’s body or general content (perhaps thinking of modules as chapters or sections of related ideas).
-
The tests directory includes testing code for validating functionality of code found in the src directory. The above follows pytest conventions. The tests directory is for code which acts like a book’s early reviewers or editors, making sure that if you change things in src the impacts remain as expected.
-
-
-
Common directory structure examples
-
-
The Python directory structure described above can be witnessed in the wild from the following resources. These can serve as a great resource for starting or adjusting your own work.
Building an understandable body of content helps tremendously with audience trust. What else can we do to enhance project trust? The following elements can help improve an audience’s trust in packaged Python work.
-
-
Source control authenticity
-
-
-
-
Be authentic! Fill out your profile to help your audience know the author and why you do what you do. See here for GitHub’s documentation on filling out your profile. Doing this may seem irrelevant but can go a long way to making technical work more relatable.
-
-
-
Add a profile picture of yourself or something fun.
-
Set your profile description to information which is both professionally accurate and unique to you.
-
Show or link to work which you feel may be relevant or exciting to those in your audience.
-
-
-
Staying up to date with supported Python releases
-
-
-
-
Use Python versions which are supported (this changes over time).
-Python versions which are end-of-life may be difficult to support and are a sign of code decay for projects. Specify the version of Python which is compatiable with your project by using environment specifications such as pyproject.toml files and related packaging tools (more on this below).
Staying up to date with supported releases oftentimes can result in performance or other similar benefits (later versions usually include improvements!).
-
-
-
Security linting and visible checks with GitHub Actions
-
-
-
-
Use security vulnerability linters to help prevent undesirable or risky processing for your audience. Doing this both practical to avoid issues and conveys that you care about those using your package!
-gitleaks: checks for sensitive passwords, keys, or tokens
-
-
-
-
-
Combining GitHub actions with security linters and tests from your software validation suite can add an observable ✅ for your project.
-This provides the audience with a sense that you’re transparently testing and sharing results of those tests.
Connection: personal and inter-package relationships
-
-
-
-
Understandability and trust set the stage for your project’s connection to other people and projects. What can we do to facilitate connection with our project? Use the following techniques to help enhance your project’s connection to others and their work.
-
-
Acknowledging authors and referenced work with CITATION.cff
-
-
-
-
Add a CITATION.cff file to your project root in order to describe project relationships and acknowledgements in a standardized way. The CFF format is also GitHub compatible, making it easier to cite your project.
-
-
-
This is similar to a book’s credits, acknowledgements, dedication, and author information sections.
Provide a CONTRIBUTING.md file to your project root so as to make clear support details, development guidance, code of conduct, and overall documentation surrounding how the project is governed.
Environment management reproducibility as connected project reality
-
-
-
-
Code without an environment specification is difficult to run in a consistent way. This can lead to “works on my machine” scenarios where different things happen for different people, reducing the chance that people can connect with a shared reality for how your code should be used.
-
-
-
“But why do we have to switch the way we do things?”
-We’ve always been switching approaches (software approaches evolve over time)! A brief history of Python environment and packaging tooling:
-
-
-
-distutils, easy_install + setup.py (primarily used during 1990’s - early 2000’s)
-
-pip, setup.py + requirements.txt (primarily used during late 2000’s - early 2010’s)
-
-poetry + pyproject.toml (began use around late 2010’s - ongoing)
-
-
-
-
Using Python poetry for environment and packaging management
-
-
-
-
Poetry is one Pythonic environment and packaging manager which can help increase reproducibility using pyproject.toml files. It’s one of many other alternatives such as hatch and pipenv.
After installation, Poetry gives us the ability to initialize a directory structure similar to what we presented earlier by using the poetry new ... command. If you’d like a more interactive version of the same, use the poetry init command to fill out various sections of your project with detailed information.
Using the poetry new ... command also initializes the content of our pyproject.toml file with opinionated details (following the recommendation from earlier in the article regarding declared Python version specification).
-
-
-poetry dependency management
-
-
user@machine % poetry add pandas
-
-Creating virtualenv package-name-1STl06GY-py3.9 in /pypoetry/virtualenvs
-Using version ^2.1.0 for pandas
-
-...
-
-Writing lock file
-
-
-
We can add dependencies directly using the poetry add ... command. This command also provides the possibility of using a group flag (for example poetry add pytest --group testing) to help organize and distinguish multiple sets of dependencies.
-
-
-
A local virtual environment is managed for us automatically.
-
-A poetry.lock file is written when the dependencies are installed to help ensure the version you installed today will be what’s used on other machines.
-
The poetry.lock file helps ensure reproducibility when dealing with dependency version ranges (where otherwise we may end up using different versions which match the dependency ranges but observe different results).
-
-
-
Running Python from the context of poetry environments
-
-
% poetry run python -c"import pandas; print(pandas.__version__)"
-
-2.1.0
-
This allows us to quickly run code through the context of the project’s environment.
-
Poetry can automatically switch between multiple environments based on the local directory structure.
-
We can also the environment as a “shell” (similar to virtualenv’s activate) with the poetry shell command which enables us to leverage a dynamic session in the context of the poetry environment.
Even if we don’t reach wider distribution on PyPI or elsewhere, source code managed by pyproject.toml and poetry can be used for “manual” distribution (with reproducible results) from GitHub repositories. When we’re ready to distribute pre-built packages on other networks we can also use the following:
-
-
% poetry build
-
-Building package-name (0.1.0)
- - Building sdist
- - Built package_name-0.1.0.tar.gz
- - Building wheel
- - Built package_name-0.1.0-py3-none-any.whl
-
-
-
Poetry readies source-code and pre-compiled versions of our code for distribution platforms like PyPI by using the poetry build ... command. We’ll cover more on these files and distribution steps with a later post!
Tip of the Week: Data Quality Validation through Software Testing Techniques
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
Diagram showing input, in-process data, and output data as a workflow.
-
-
-
Data orientated software development can benefit from a specialized focus on varying aspects of data quality validation.
-We can use software testing techniques to validate certain qualities of the data in order to meet a declarative standard (where one doesn’t need to guess or rediscover known issues).
-These come in a number of forms and generally follow existing software testing concepts which we’ll expand upon below.
-This article will cover a few tools which leverage these techniques for addressing data quality validation testing.
-
-
Data Quality Testing Concepts
-
-
Hoare Triple
-
-
-
-
One concept we’ll use to present these ideas is Hoare logic, which is a system for reasoning on software correctness.
-Hoare logic includes the idea of a Hoare triple ($ {\displaystyle {P}C{Q}} $) where $ {\displaystyle {P}} $ is an assertion of precondition, $ {\displaystyle \ C} $ is a command, and $ {\displaystyle {Q}} $ is a postcondition assertion.
-Software development using data often entails (sometimes assumed) assertions of precondition from data sources, a transformation or command which changes the data, and a (sometimes assumed) assertion of postcondition in a data output or result.
-
-
Design by Contract
-
-
-
-
Data testing through design by contract over Hoare triple.
-
-
Hoare logic and Software correctness help describe design by contract (DbC), a software approach involving the formal specification of “contracts” which help ensure we meet our intended goals.
-DbC helps describe how to create assertions when proceeding through Hoare triplet states for data.
-These concepts provide a framework for thinking about the tools mentioned below.
-
-
Data Component Testing
-
-
-
-
Diagram showing data contracts as generalized and reusable “component” testing being checked through contracts and raising an error if they aren’t met or continuing operations if they are met.
-
-
We often need to verify a certain component’s surrounding data in order to ensure it meets minimum standards.
-The word “component” is used here from the context of component-based software design to group together reusable, modular qualities of the data where sometimes we don’t know (or want) to specify granular aspects (such as schema, type, column name, etc).
-These components often are implied by software which will eventually use the data, which can emit warnings or errors when they find the data does not meet these standards.
-Oftentimes these components are contracts checking postconditions of earlier commands or procedures, ensuring the data we receive is accurate to our intention.
-We can avoid these challenges by creating contracts for our data to verify the components of the result before it reaches later stages.
-
-
Examples of these data components might include:
-
-
-
The dataset has no null values.
-
The dataset has no more than 3 columns.
-
The dataset has a column called numbers which includes numbers in the range of 0-10.
-
-
-
Data Component Testing - Great Expectations
-
-
"""
-Example of using Great Expectations
-Referenced with modifications from:
-https://docs.greatexpectations.io/docs/tutorials/quickstart/
-"""
-importgreat_expectationsasgx
-
-# get gx DataContext
-# see: https://docs.greatexpectations.io/docs/terms/data_context
-context=gx.get_context()
-
-# set a context data source
-# see: https://docs.greatexpectations.io/docs/terms/datasource
-validator=context.sources.pandas_default.read_csv(
- "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
-)
-
-# add and save expectations
-# see: https://docs.greatexpectations.io/docs/terms/expectation
-validator.expect_column_values_to_not_be_null("pickup_datetime")
-validator.expect_column_values_to_be_between("passenger_count",auto=True)
-validator.save_expectation_suite()
-
-# checkpoint the context with the validator
-# see: https://docs.greatexpectations.io/docs/terms/checkpoint
-checkpoint=context.add_or_update_checkpoint(
- name="my_quickstart_checkpoint",
- validator=validator,
-)
-
-# gather checkpoint expectation results
-checkpoint_result=checkpoint.run()
-
-# show the checkpoint expectation results
-context.view_validation_result(checkpoint_result)
-
-
-
Example code leveraging Python package Great Expectations to perform various data component contract validation.
-
-
Great Expectations is a Python project which provides data contract testing features through the use of component called “expectations” about the data involved.
-These expectations act as a standardized way to define and validate the component of the data in the same way across different datasets or projects.
-In addition to providing a mechanism for validating data contracts, Great Expecations also provides a way to view validation results, share expectations, and also build data documentation.
-See the above example for a quick code reference of how these work.
-
-
Data Component Testing - Assertr
-
-
# Example using the Assertr package
-# referenced with modifications from:
-# https://docs.ropensci.org/assertr/articles/assertr.html
-library(dplyr)
-library(assertr)
-
-# set our.data to reference the mtcars dataset
-our.data<-mtcars
-
-# simulate an issue in the data for contract specification
-our.data$mpg[5]<-our.data$mpg[5]*-1
-
-# use verify to validate that column mpg >= 0
-our.data%>%
- verify(mpg>=0)
-
-# use assert to validate that column mpg is within the bounds of 0 to infinity
-our.data%>%
- assert(within_bounds(0,Inf),mpg)
-
-
-
Example code leveraging R package Assertr to perform various data component contract validation.
-
-
Assertr is an R project which provides similar data component assertions in the form of verify, assert, and insist methods (see here for more documentation).
-Using Assertr enables a similar but more lightweight functionality to that of Great Expectations.
-See the above for an example of how to use it in your projects.
-
-
Data Schema Testing
-
-
-
-
Diagram showing data contracts as more granular specifications via “schema” testing being checked through contracts and raising an error if they aren’t met or continuing operations if they are met.
-
-
Sometimes we need greater specificity than what a data component can offer.
-We can use data schema testing contracts in these cases.
-The word “schema” here is used from the context of database schema, but oftentimes these specifications are suitable well beyond solely databases (including database-like formats like dataframes).
-While reuse and modularity are more limited with these cases, they can be helpful for efforts where precision is valued or necessary to accomplish your goals.
-It’s worth mentioning that data schema and component testing tools often have many overlaps (meaning you can interchangeably use them to accomplish both tasks).
-
-
Data Schema Testing - Pandera
-
-
"""
-Example of using the Pandera package
-referenced with modifications from:
-https://pandera.readthedocs.io/en/stable/try_pandera.html
-"""
-importpandasaspd
-importpanderaaspa
-frompandera.typingimportDataFrame,Series
-
-
-# define a schema
-classSchema(pa.DataFrameModel):
- item:Series[str]=pa.Field(isin=["apple","orange"],coerce=True)
- price:Series[float]=pa.Field(gt=0,coerce=True)
-
-
-# simulate invalid dataframe
-invalid_data=pd.DataFrame.from_records(
- [{"item":"applee","price":0.5},
- {"item":"orange","price":-1000}]
-)
-
-
-# set a decorator on a function which will
-# check the schema as a precondition
-@pa.check_types(lazy=True)
-defprecondition_transform_data(data:DataFrame[Schema]):
- print("here")
- returndata
-
-
-# precondition schema testing
-try:
- precondition_transform_data(invalid_data)
-exceptpa.errors.SchemaErrorsasschema_excs:
- print(schema_excs)
-
-# inline or implied postcondition schema testing
-try:
- Schema.validate(invalid_data)
-exceptpa.errors.SchemaErrorasschema_exc:
- print(schema_exc)
-
-
-
Example code leveraging Python package Pandera to perform various data schema contract validation.
-
-
DataFrame-like libraries like Pandas can verified using schema specification contracts through Pandera (see here for full DataFrame library support).
-Pandera helps define specific columns, column types, and also has some component-like features.
-It leverages a Pythonic class specification, similar to data classes and pydantic models, making it potentially easier to use if you already understand Python and DataFrame-like libraries.
-See the above example for a look into how Pandera may be used.
-
-
Data Schema Testing - JSON Schema
-
-
# Example of using the jsonvalidate R package.
-# Referenced with modifications from:
-# https://docs.ropensci.org/jsonvalidate/articles/jsonvalidate.html
-
-schema<-'{
- "$schema": "https://json-schema.org/draft/2020-12/schema",
- "title": "Hello World JSON Schema",
- "description": "An example",
- "type": "object",
- "properties": {
- "hello": {
- "description": "Provide a description of the property here",
- "type": "string"
- }
- },
- "required": [
- "hello"
- ]
-}'
-
-# create a schema contract for data
-validate<-jsonvalidate::json_validator(schema,engine="ajv")
-
-# validate JSON using schema specification contract and invalid data
-validate("{}")
-
-# validate JSON using schema specification contract and valid data
-validate("{'hello':'world'}")
-
-
-
JSON Schema provides a vocabulary way to validate schema contracts for JSON documents.
-There are several implementations of the vocabulary, including Python package jsonschema, and R package jsonvalidate.
-Using these libraries allows you to define pre- or postcondition data schema contracts for your software work.
-See above for an R based example of using this vocabulary to perform data schema testing.
-
-
Shift-left Data Testing
-
-
-
-
Earlier portions of this article have covered primarily data validation of command side-effects and postconditions.
-This is commonplace in development where data sources usually are provided without the ability to validate their precondition or definition.
-Shift-left testing is a movement which focuses on validating earlier in the lifecycle if and when possible to avoid downstream issues which might occur.
-
-
Shift-left Data Testing - Data Version Control (DVC)
-
-
-
-
Data sources undergoing frequent changes become difficult to use because we oftentimes don’t know when the data is from or what version it might be.
-This information is sometimes added in the form of filename additions or an update datetime column in a table.
-Data Version Control (DVC) is one tool which is specially purposed to address this challenge through source control techniques.
-Data managed by DVC allows software to be built in such a way that version preconditions are validated before reaching data transformations (commands) or postconditions.
-
-
Shift-left Data Testing - Flyway
-
-
-
-
Database sources can leverage an idea nicknamed “database as code” (which builds on a similar idea about infrastructure as code) to help declare the schema and other elements of a database in the same way one would code.
-These ideas apply to both databases and also more broadly through DVC mentioned above (among other tools) via the concept “data as code”.
-Implementing this idea has several advantages from source versioning, visibility, and replicability.
-One tool which implements these ideas is Flyway which can manage and implement SQL-based files as part of software data precondition validation.
-A lightweight alternative to using Flyway is sometimes to include a SQL file which creates related database objects and becomes data documentation.
Tip of the Week: Codesgiving - Open-source Contribution Walkthrough
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
Introduction
-
-
-
-
-
Thanksgiving is a holiday practiced in many countries which focuses on gratitude for good harvests of the preceding year.
-In the United States, we celebrate Thanksgiving on the fourth Thursday of November each year often by eating meals we create together with others.
-This post channels the spirit of Thanksgiving by giving our thanks through code as a “Codesgiving”, acknowledging and creating better software together.
-
-
-
Giving Thanks to Open-source Harvests
-
-
-
-
Part of building software involves the use of code which others have built, maintained, and distributed for a wider audience.
-Using other people’s work often comes in the form of open-source “harvesting” as we find solutions to software challenges we face.
-Examples might include installing and depending upon Python packages from PyPI or R packages from CRAN within your software projects.
-
-
-
“Real generosity toward the future lies in giving all to the present.”
-- Albert Camus
-
-
-
These open-source projects have internal costs which are sometimes invisible to those who consume them.
-Every software project has an implied level of software gardening time costs involved to impede decay, practice continuous improvements, and evolve the work.
-One way to actively share our thanks for the projects we depend on is through applying our time towards code contributions on them.
-
-
Many projects are in need of additional people’s thinking and development time.
-Have you ever noticed something that needs to be fixed or desirable functionality in a project you use?
-Consider adding your contributions to open-source!
-
-
All Contributions Matter
-
-
-
-
Contributing to open-source can come in many forms and contributions don’t need to be gigantic to make an impact.
-Software often involves simplifying complexity.
-Simplification requires many actions beyond solely writing code.
-For example, a short walk outside, a conversation with someone, or a nap can sometimes help us with breakthroughs when it comes to development.
-By the same token, open-source benefits greatly from communications on discussion boards, bug or feature descriptions, or other work that might not be strictly considered “engineering”.
-
-
An Open-source Contribution Approach
-
-
-
-
The troubleshooting process as a workflow involving looped checks for verifying an issue and validating the solution fixes an issue.
-
-
It can feel overwhelming to find a way to contribute to open-source.
-Similar to other software methodology, modularizing your approach can help you progress without being overwhelmed.
-Using a troubleshooting approach like the above can help you break down big challenges into bite-sized chunks.
-Consider each step as a “module” or “section” which needs to be addressed sequentially.
-
-
Embrace a Learning Mindset
-
-
-
“Before you speak ask yourself if what you are going to say is true, is kind, is necessary, is helpful. If the answer is no, maybe what you are about to say should be left unsaid.”
-- Bernard Meltzer
-
-
-
Open-source contributions almost always entail learning of some kind.
-Many contributions happen solely in the form of code and text communications which are easily misinterpreted.
-Assume positive intent and accept input from others while upholding your own ideas to share successful contributions together.
-Prepare yourself by intentionally opening your mind to input from others, even if you’re sure you’re absolutely “right”.
-
-
-
-
-
-
-
Before communicating, be sure to use Bernard Meltzer’s self-checks mentioned above.
-
-
-
Is what I’m about to say true?
-
-
Have I taken time to verify the claims in a way others can replicate or understand?
-
-
-
Is what I’m about to say kind?
-
-
Does my intention and communication channel kindness (and not cruelty)?
-
-
-
Is what I’m about to say necessary?
-
-
Do my words and actions here enable or enhance progress towards a goal (would the outcome be achieved without them)?
-
-
-
Is what I’m about to say helpful?
-
-
How does my communication increase the quality or sustainability of the project (or group)?
-
-
-
-
-
-
-
-
Setting Software Scheduling Expectations
-
-
-
-
-
-
-
-
-
-
-
-
Suggested ratio of time spent by type of work for an open-source contribution.
-
-
-
1/3 planning (~33%)
-
1/6 coding (~16%)
-
1/4 component and system testing (25%)
-
1/4 code review, revisions, and post-actions (25%)
-
-
-
This modified rule of thumb from The Mythical Man Month can assist with how you structure your time for an open-source contribution.
-Notice the emphasis on planning and testing and keep these in mind as you progress (the actual programming time can be small if adequate time has been spent on planning).
-Notably, the original time fractions are modified here with the final quarter of the time spent suggested as code review, revisions, and post-actions.
-Planning for the time expense of the added code review and related elements assists with keeping a learning mindset throughout the process (instead of feeling like the review is a “tack-on” or “optional / supplementary”).
-A good motto to keep in mind throughout this process is Festina lente, or “Make haste, slowly.” (take care to move thoughtfully and as slowly as necessary to do things correctly the first time).
-
-
Planning an Open-source Contribution
-
-
Has the Need Already Been Reported?
-
-
-
-
Be sure to check whether the bug or feature has already been reported somewhere!
-In a way, this is a practice of “Don’t repeat yourself” (DRY) where we attempt to avoid repeating the same block of code (in this case, the “code” can be understood as natural language).
-For example, you can look on GitHub Issues or GitHub Discussions with a search query matching the rough idea of what you’re thinking about.
-You can also use the GitHub search bar to automatically search multiple areas (including Issues, Discussions, Pull Requests, etc.) when you enter a query from the repository homepage.
-If it has been reported already, take a look to see if someone has made a code contribution related to the work already.
-
-
An open discussion or report of the need doesn’t guarantee someone’s already working on a solution.
-If there aren’t yet any code contributions and it doesn’t look like anyone is working on one, consider volunteering to take a further look into the solution and be sure to acknowledge any existing discussions.
-If you’re unsure, it’s always kind to mention your interest in the report and ask for more information.
-
-
Is the Need a Bug or Feature?
-
-
-
-
-
-
-
-
-
-
-
One way to help solidify your thinking and the approach is to consider whether what you’re proposing is a bug or a feature.
-A software bug is considered something which is broken or malfunctioning.
-A software feature is generally considered new functionality or a different way of doing things than what exists today.
-There’s often overlap between these, and sometimes they can inspire branching needs, but individually they usually are more of one than the other.
-If you can’t decide whether your need is a bug or a feature, consider breaking it down into smaller sub-components so they can be more of one or the other.
-Following this strategy will help you communicate the potential for contribution and also clarify the development process (for example, a critical bug might be prioritized differently than a nice-to-have new feature).
-
-
Reporting the Need for Change
-
-
# Using `function_x` with `library_y` causes `exception_z`
-
-## Summary
-
-As a `library_y` research software developer I want to use `function_x`
-for my data so that I can share data for research outcomes.
-
-## Reproducing the error
-
-This error may be seen using Python v3.x on all major OS's using
-the following code snippet:
-...
-
-
-
-
An example of a user story issue report with imagined code example.
-
-
Open-source needs are often best reported through written stories captured within a bug or feature tracking system (such as GitHub Issues) which if possible also include example code or logs.
-One template for reporting issues is through a “user story”.
-A user story typically comes in the form: As a < type of user >, I want < some goal > so that < some reason >. (Mountain Goat Software: User Stories).
-Alongside the story, it can help to add in a snippet of code which exemplifies a problem, new functionality, or a potential adjacent / similar solution.
-As a general principle, be as specific as you can without going overboard.
-Include things like programming language version, operating system, and other system dependencies that might be related.
-
-
Once you have a good written description of the need, be sure to submit it where it can be seen by the relevant development community.
-For GitHub-based work, this is usually a GitHub Issue, but can also entail discussion board posts to gather buy-in or consensus before proceeding.
-In addition to the specifics outlined above, also recall the learning mindset and Bernard Meltzer’s self-checks, taking time to acknowledge especially the potential challenges and already attempted solutions associated with the description (conveying kindness throughout).
-
-
What Happens After You Submit a Bug or Feature Report?
-
-
-
-
When making open-source contributions, sometimes it can also help to mention that you’re interested in resolving the issue through a related pull request and review.
-Oftentimes open-source projects welcome new contributors but may have specific requirements.
-These requirements are usually spelled out within a CONTRIBUTING.md document found somewhere in the repository or the organization level documentation.
-It’s also completely okay to let other contributors build solutions for the issue (like we mentioned before, all contributions matter, including the reporting of bugs or features themselves)!
-
-
Developing and Testing an Open-source Contribution
-
-
Creating a Development Workspace
-
-
-
-
Once ready to develop a solution for the reported need in the open-source project you’ll need a place to version your updates.
-This work generally takes place through version control on focused branches which are named in a way that relates to the focus.
-When working on GitHub, this work also commonly takes place on forked repository copies.
-Using these methods helps isolate your changes from other work that takes place within the project.
-It also can help you track your progress alongside related changes that might take place before you’re able to seek review or code merges.
-
-
Bug or Feature Verification with Test-driven Development
-
-
-
-
-
-
-
One can use a test-driven development approach as numbered steps (Wikipedia).
-
-
-
-
Add or modify a test which checks for a bug fix or feature addition
-
Run all tests (expecting the newly added test content to fail)
-
Write a simple version of code which allows the tests to succeed
-
Verify that all tests now pass
-
Return to step 3, refactoring the code as needed
-
-
-
-
-
-
-
-
If you decide to develop a solution for what you reported, one software strategy which can help you remain focused and objective is test-driven development.
-Using this pattern sets a “cognitive milestone” for you as you develop a solution to what was reported.
-Open-source projects can have many interesting components which could take time and be challenging to understand.
-The addition of the test and related development will help keep you goal-orientated without getting lost in the “software forest” of a project.
-
-
Prefer Simple Over Complex Changes
-
-
-
…
-Simple is better than complex.
-Complex is better than complicated.
-…
-- PEP 20: The Zen of Python
-
-
-
Further channeling step 3. from test-driven development above, prefer simple changes over more complex ones (recognizing that the absolute simplest can take iteration and thought).
-Some of the best solutions are often the most easily understood ones (where the code addition or changes seem obvious afterwards).
-A “simplest version” of the code can often be more quickly refactored and completed than devising a “perfect” solution the first time.
-Remember, you’ll very likely have the help of a code review before the code is merged (expect to learn more and add changes during review!).
-
-
It might be tempting to address more than one bug or feature at the same time.
-Avoid feature creep as you build solutions - stay focused on the task at hand!
-Take note of things you notice on your journey to address the reported needs.
-These can be become additional reported bugs or features which could be addressed later.
-Staying focused with your development will save you time, keep your tests constrained, and (theoretically) help reduce the time and complexity of code review.
-
-
Developing a Solution
-
-
-
-
Once you have a test in place for the bug fix or feature addition it’s time to work towards developing a solution.
-If you’ve taken time to accomplish the prior steps before this point you may already have a good idea about how to go about a solution.
-If not, spend some time investigating the technical aspects of a solution, optionally adding this information to the report or discussion content for further review before development.
-Use timeboxing techniques to help make sure the time you spend in development is no more than necessary.
-
-
Code Review, Revisions, and Post-actions
-
-
Pull Requests and Code Review
-
-
When your code and new test(s) are in a good spot it’s time to ask for a code review.
-It might feel tempting to perfect the code.
-Instead, consider whether the code is “good enough” and would benefit from someone else providing feedback.
-Code review takes advantage of a strength of our species: collaborative & multi-perspectival thinking.
-Leverage this in your open-source experience by seeking feedback when things feel “good enough”.
-
-
-
-
-
-
Demonstrating Pareto Principle “vital few” through a small number of changes to achieve 80% of the value associated with the needs.
-
-
One way to understand “good enough” is to assess whether you have reached what the Pareto Principle terms as the “vital few” causes.
-The Pareto Principle states that roughly 80% of consequences come from 20% of causes (the “vital few”).
-What are the 20% changes (for example, as commits) which are required to achieve 80% of the desired intent for development with your open-source contribution?
-When you reach those 20% of the changes, consider opening a pull request to gather more insight about whether those changes will suffice and how the remaining effort might be spent.
-
-
As you go through the process of opening a pull request, be sure to follow the open-source CONTRIBUTING.md document documentation related to the project; each one can vary.
-When working on GitHub-based projects, you’ll need to open a pull request on the correct branch (usually upstream main).
-If you used a GitHub issue to help report the issue, mention the issue in the pull request description using the #issue number (for example #123 where the issue link would look like: https://github.com/orgname/reponame/issues/123) reference to help link the work to the reported need.
-This will cause the pull request to show up within the issue and automatically create a link to the issue from the pull request.
-
-
Code Revisions
-
-
-
“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”
-- Antoine de Saint-Exupery
-
-
-
You may be asked to update your code based on automated code quality checks or reviewer request.
-Treat these with care; embrace learning and remember that this step can take 25% of the total time for the contribution.
-When working on GitHub forks or branches, you can make additional commits directly on the development branch which was used for the pull request.
-If your reviewers requested changes, re-request their review once changes have been made to help let them know the code is ready for another look.
-
-
Post-actions and Tidying Up Afterwards
-
-
-
-
Once the code has been accepted by the reviewers and through potential automated testing suite(s) the content is ready to be merged.
-Oftentimes this work is completed by core maintainers of the project.
-After the code is merged, it’s usually a good idea to clean up your workspace by deleting your development branch and syncing with the upstream repository.
-While it’s up to core maintainers to decide on report closure, typically the reported need content can be closed and might benefit from a comment describing the fix.
-Many of these steps are considered common courtesy but also, importantly, assist in setting you up for your next contributions!
-
-
Concluding Thoughts
-
-
Hopefully the above helps you understand the open-source contribution process better.
-As stated earlier, every little part helps!
-Best wishes on your open-source journey and happy Codesgiving!
-
-
References
-
-
-
Top Image: Französischer Obstgarten zur Erntezeit (Le verger) by Charles-François Daubigny (cropped). (Source: Wikimedia Commons)
Tip of the Week: Python Memory Management and Troubleshooting
-
-
-
-
-
-
-
Each month we seek to provide a software tip of the month geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out!
-
-
-
-
-
Introduction
-
-
-
Have you ever run Python code only to find it taking forever to complete or sometime abruptly ending with an error like: 123456 Killed or killed (program exited with code: 137)?
-You may have experienced memory resource or management challenges associated with these scenarios.
-This post will cover some computer memory definitions, how Python makes use of computer memory, and share some tools which may help with these types of challenges.
-
-
-
What is Memory?
-
-
Computer Memory
-
-
-
-
Computer memory is a type of computer resource available for use by software on a computer
-
-
Computer memory, also sometimes known as “RAM” or “random-access memory”, or “dynamic memory” is a type of resource used by computer software on a computer.
-“Computer memory stores information, such as data and programs for immediate use in the computer. … Main memory operates at a high speed compared to non-memory storage which is slower but less expensive and oftentimes higher in capacity. “ (Wikipedia: Computer memory).
-
-
-
-
Memory Blocks
-
-
-A.) All memory blocks available.<table>
-
-
Block
-
Block
-
Block
-</table>
-
-B.) Some memory blocks in use.<div class="table-wrapper"><table>
-
-
Block
-
Block
-
Block
-</table></div>
-
-
-
Practical analogy
-
-
-
-C.) You have limited buckets to hold things.<div class="table-wrapper"><table>
-
-
🪣
-
🪣
-
🪣
-</table></div>
-
-D.) Two buckets are used, the other remains empty.<div class="table-wrapper"><table>
-
-
🪣
-
🪣
-
🪣
-</table></div>
-
-
-
-
Fixed-size memory blocks may be free or used at various times. They can be thought of like reusable buckets to hold things.
-
-
One way to organize computer memory is through the use of “fixed-size blocks”, also called “blocks”.
-Fixed-size memory blocks are chunks of memory of a certain byte size (usually all the same size).
-Memory blocks may be in use or free at different times.
-
-
-
-
Memory heaps help organize available memory on a computer for specific procedures. Heaps may have one or many memory pools.
-
-
Computer memory blocks may be organized in hierarchical layers to manage memory efficiently or towards a specific purpose.
-One top-level organization model for computer memory is through the use of heaps which help describe chunks of the total memory available on a computer for specific processes.
-These heaps may be private (only available to a specific software process) or shared (available to one or many software processes).
-Heaps are sometimes further segmented into pools which are areas of the heap which can be used for specific purposes.
-
-
Memory Allocator
-
-
-
-
Memory allocators help software reserve and free computer memory resources.
-
-
Memory management is a concept which helps enable the shared use of computer memory to avoid challenges such as memory overuse (where all memory is in use and never shared to other software).
-Computer memory management often occurs through the use of a memory allocator which controls how computer memory resources are used for software.
-Computer software is written to interact with memory allocators to use computer memory.
-Memory allocators may be used manually (with specific directions provided on when and how to use memory resources) or automatically (with an algorithmic approach of some kind).
-The memory allocator usually performs the following actions with memory (in addition to others):
-
-
-
-“Allocation”: computer memory resource reservation (taking memory). This is sometimes also known as “alloc”, or “allocate memory”.
-
-“Deallocation”: computer memory resource freeing (giving back memory for other uses). This is sometimes also known as “free”, or “freeing memory from allocation”.
-
-
-
Garbage Collection
-
-
-
-
Garbage collectors help free computer memory which is no longer referenced by software.
-
-
“Garbage collection (GC)” is used to describe a type of automated memory management.
-“The garbage collector attempts to reclaim memory which was allocated by the program, but is no longer referenced; such memory is called garbage.” (Wikipedia: Garbage collection (computer science)).
-A garbage collector often works in tandem with a memory allocator to help control computer memory resource usage in software development.
-
-
How Does Python Interact with Computer Memory?
-
-
Python Overview
-
-
-
-
A Python interpreter executes Python code and manages memory for Python procedures.
-
-
Python is an interpreted “high-level” programming language (Python: What is Python?).
-Interpreted languages are those which include an “interpreter” which helps execute code written in a particular way (Wikipedia: Interpreter (computing)).
-High-level languages such as Python often remove the requirement for software developers to manually perform memory management (Wikipedia: High-level programming language).
-
-
Python code is executed by a commonly pre-packaged and downloaded binary call the Python interpreter.
-The Python interpreter reads Python code and performs memory management as the code is executed.
-The CPython Python interpreter is the most commonly used interpreter for Python, and what’s use as a reference for other content here.
-There are also other interpreters such as PyPy, Jython, and IronPython which all handle memory differently than the CPython interpreter.
-
-
Python’s Memory Manager
-
-
-
-
The Python memory manager helps manage memory for Python code executed by the Python interpreter.
-
-
Memory is managed for Python software processes automatically (when unspecified) or manually (when specified) through the Python interpreter.
-The Python memory manager is an abstraction which manages memory for Python software processes through the Python interpreter (Python: Memory Management).
-From a high-level perspective, we assume variables and other operations written in Python will automatically allocate and deallocate memory through the Python interpreter when executed.
-The Python memory manager .
-Python’s memory manager performs work through various memory allocators and a garbage collector (or as configured with customizations) within a private Python memory heap.
-
-
Python’s Memory Allocators
-
-
-
-
The Python memory manager by default will use pymalloc internally or malloc from the system to allocate computer memory resources.
-
-
The Python memory manager allocates memory for use through memory allocators.
-Python may use one or many memory allocators depending on specifications in Python code and how the Python interpreter is configured (for example, see Python: Memory Management - Default Memory Allocators).
-One way to understand Python memory allocators is through the following distinctions.
-
-
-
-“Python Memory Allocator” (pymalloc)
-The Python interpreter is packaged with a specialized memory allocator called pymalloc.
-“Python has a pymalloc allocator optimized for small objects (smaller or equal to 512 bytes) with a short lifetime.” (Python: Memory Management - The pymalloc allocator).
-Ultimately, pymalloc uses C malloc to implement memory work.
-
-C dynamic memory allocator (malloc)
-When pymalloc is disabled or a memory requirements exceed pymalloc’s constraints, the Python interpreter will directly use a function from the C standard library called malloc.
-When malloc is used by the Python interpreter, it uses the system’s existing implementation of malloc.
-
-
-
-
-
pymalloc makes use of arenas to further organize pools within a computer memory heap.
-
-
It’s important to note that pymalloc adds additional abstractions to how memory is organized through the use of “arenas”.
-These arenas are specific to pymalloc purposes.
-pymalloc may be disabled through the use of a special environment variable called PYTHONMALLOC (for example, to use only malloc as seen below).
-This same environment variable may be used with debug settings in order to help troubleshoot in-depth questions.
-
-
Additional Python Memory Allocators
-
-
-
-
Python code and package dependencies may stipulate the use of additional memory allocators, such as mimalloc and jemalloc outside of the Python memory manager.
-
-
Python provides the capability of customizing memory allocation through the use of packages.
-See below for some notable examples of additional memory allocation possibilities.
-
-
-
-NumPy Memory Allocation
-NumPyuses custom C-API’s which are backed by C dynamic memory allocation functions (alloc, free, realloc) to help address memory management.
-These interfaces can be controlled directly through NumPy to help manage memory effectively when using the package.
-
-PyArrow Memory Allocators
-PyArrow provides the capability to use malloc, jemalloc, or mimalloc through the PyArrow Memory Pools group of functions.
-A default memory allocator is selected for use when PyArrow based on the operating system and the availability of the memory allocator on the system.
-The selection of a memory allocator for use with PyArrow can be influenced by how it performs on a particular system.
-
-
-
Python Reference Counting
-
-
-
-
Processed line of code
-
Reference count
-
-
-
-
-
a_string="cornucopia"
-
-
-
a_string: 1
-
-
-
-
reference_a_string=a_string
-
-
-
a_string: 2
-(Because a_string is now referenced twice.)
-
-
-
-
delreference_a_string
-
-
-
a_string: 1
-(Because the additional reference has been deleted.)
-
-
-</table>
-
-_Python reference counting at a simple level works through the use of object reference increments and decrements._
-
-As computer memory is allocated to Python processes the Python memory manager keeps track of these through the use of a [reference counter](https://en.wikipedia.org/wiki/Reference_counting).
-In Python, we could label this as an "Object reference counter" because all data in Python is represented by objects ([Python: Data model](https://docs.python.org/3/reference/datamodel.html#objects-values-and-types)).
-"... CPython counts how many different places there are that have a reference to an object. Such a place could be another object, or a global (or static) C variable, or a local variable in some C function." ([Python Developer's Guide: Garbage collector design](https://devguide.python.org/internals/garbage-collector/)).
-
-### Python's Garbage Collection
-
-
-
-_The Python garbage collector works as part of the Python memory manager to free memory which is no longer needed (based on reference count)._
-
-Python by default uses an optional garbage collector to automatically deallocate garbage memory through the Python interpreter in CPython.
-"When an object’s reference count becomes zero, the object is deallocated." ([Python Developer's Guide: Garbage collector design](https://devguide.python.org/internals/garbage-collector/))
-Python's garbage collector focuses on collecting garbage created by `pymalloc`, C memory functions, as well as other memory allocators like `mimalloc` and `jemalloc`.
-
-## Python Tools for Observing Memory Behavior
-
-### Python Built-in Tools
-
-```python
-import gc
-import sys
-
-# set gc in debug mode for detecting memory leaks
-gc.set_debug(gc.DEBUG_LEAK)
-
-# create an int object
-an_object = 1
-
-# show the number of uncollectable references via COLLECTED
-COLLECTED = gc.collect()
-print(f"Uncollectable garbage references: {COLLECTED}")
-
-# show the reference count for an object
-print(f"Reference count of `an_object`: {sys.getrefcount(an_object)}")
-```
-
-The [`gc` module](https://docs.python.org/3/library/gc.html) provides an interface to the Python garbage collector.
-In addition, the [`sys` module](https://docs.python.org/3/library/sys.html) provides many functions which provide information about references and other details about Python objects as they are executed through the interpreter.
-These functions and other packages can help software developers observe memory behaviors within Python procedures.
-
-### Python Package: Scalene
-
-
-
-
-[Scalene](https://github.com/plasma-umass/scalene) is a Python package for analyzing memory, CPU, and GPU resource consumption.
-It provides [a web interface](https://github.com/plasma-umass/scalene?tab=readme-ov-file#web-based-gui) to help visualize and understand how resources are consumed.
-Scalene provides suggestions on which portions of your code to troubleshoot through the web interface.
-Scalene can also be configured to work with [OpenAI](https://en.wikipedia.org/wiki/OpenAI) [LLM's](https://en.wikipedia.org/wiki/Large_language_model) by way of a an [OpenAI API provided by the user](https://github.com/plasma-umass/scalene?tab=readme-ov-file#ai-powered-optimization-suggestions).
-
-### Python Package: Memray
-
-
-
-
-[Memray](https://github.com/bloomberg/memray) is a Python package to track memory allocation within Python and compiled extension modules.
-Memray provides a high-level way to investigate memory performance and adds visualizations such as [flamegraphs](https://www.brendangregg.com/flamegraphs.html)(which contextualization of [stack traces](https://en.wikipedia.org/wiki/Stack_trace) and memory allocations in one spot).
-Memray seeks to provide a way to overcome challenges with tracking and understanding Python and other memory allocators (such as C, C++, or Rust libraries used in tandem with a Python process).
-
-## Concluding Thoughts
-
-It's worth mentioning that this article covers only a small fraction of how and what memory is as well as how Python might make use of it.
-Hopefully it clarifies the process and provides a way to get started with investigating memory within the software you work with.
-Wishing you the very best in your software journey with memory!
-
Our primary focus is creating high quality software and maintaining existing software.
-We have a diverse team with a wide range of experience and expertise in software projects related to data-science, biology, medicine, statistics, and machine learning.
-
-
We can take a lab’s ideas and scientific work, and turn them into a fully-realized, complete package of software, for both experts and lay-persons alike, that enables exploration of data, dissemination of knowledge, collaboration, advanced analyses, new insights, or lots more you could imagine.
-
-
Some of the things we do are:
-
-
-
Modern, responsive web applications for data scientists, biologists, and more
-
Powerful and flexible backend systems and APIs for programmers
-
Interactive data visualizations for insight and engagement
-
Research and mockups for rapid iteration and to ensure value before developing
-
Efficient and reproducible pipelines for analyzing large data
-
Scalable and robust cloud infrastructure
-
Beautiful websites with cohesive branding and graphics
-
Automations for processing, deployment, testing, monitoring, etc. that can save time, money, and hassle
-
Code reviews for ensuring reliability, maintainability, and best practices
-
Advising on software-related grant writing
-
Software-focused technical support
-
Workshops, training, teaching, etc.
-
-
-
But the best way to understand the things we do is by looking at the code and using the software yourself:
Whenever we can, we like to share our knowledge and skills to others.
-We believe this benefits the community we operate in and allows us to create better software together.
-
-
On this website, we have a blog where we occasionally post tips, tricks, and other insights related to Git, workflows, code quality, and more.
-
-
We have given workshops and personalized lessons related to Docker, cloud services, and more.
-We’re always happy to set up a session to discuss technical trade whenever someone has the need.
-
-
Scope of our work
-
-
Being central to the department, and not strictly associated with any particular lab or group within it, we need to ensure that we divide up our time and effort fairly.
-While we can do things like build full-stack apps from scratch and maintain complex infrastructure, the projects we take on tend to be small to medium size so that we leave ourselves available to others who need our help.
-Certain projects that are very large and long term in scope, such as ones that need to be HIPAA compliant, will fall outside of our purview and might lead you to hire a dedicated developer to fill your needs.
-That said, we can still provide partial support as a consulting body, a repository of information, a hiring advisor, and more.
-
-
Contact
-
-
Request Support
-
-
Start here to establish a project and work with us.
Schedule a meeting with us about an established project.
-If you haven’t met with us yet on this particular project, please start by requesting support above.
In the notes field, please specify which team members are optional/required for this meeting.
-Also list any additional emails, and we’ll forward the calendar invite to them.
-
-
Chat
-
-
For general questions or technical help, we also have weekly open-office hours, Thursdays at 2:00 PM Mountain Time in the following Zoom room.
-Feel free to stop by!
-
-
-
-
-
-
You can also come to the Zoom room if you’re unsure about something with the requesting support process mentioned above.
-
-Have you ever run Python code only to find it taking forever to complete or sometime abruptly ending with an error like: 123456 Killed or killed (program exited with code: 137)?
-You may have experienced memory resource or management challenges associated with these scenarios.
-This post will cover some computer memory definitions, how Python makes use of computer memory, and share some tools which may help with these types of challenges.
-
-
-
-Thanksgiving is a holiday practiced in many countries which focuses on gratitude for good harvests of the preceding year.
-In the United States, we celebrate Thanksgiving on the fourth Thursday of November each year often by eating meals we create together with others.
-This post channels the spirit of Thanksgiving by giving our thanks through code as a “Codesgiving”, acknowledging and creating better software together.
-
-
-
-Data orientated software development can benefit from a specialized focus on varying aspects of data quality validation.
-We can use software testing techniques to validate certain qualities of the data in order to meet a declarative standard (where one doesn’t need to guess or rediscover known issues).
-These come in a number of forms and generally follow existing software testing concepts which we’ll expand upon below.
-This article will cover a few tools which leverage these techniques for addressing data quality validation testing.
-
-
-
-
-Python packaging is the craft of preparing for and reaching distribution of your Python work to wider audiences. Following conventions for packaging help your software work become more understandable, trustworthy, and connected (to others and their work). Taking advantage of common packaging practices also strengthens our collective superpowers: collaboration. This post will cover preparation aspects of packaging, readying software work for wider distribution.
-
-
-
-
-
-This post is intended to help demonstrate the use of Python on Alpine, a High Performance Compute (HPC) cluster hosted by the University of Colorado Boulder’s Research Computing.
-We use Python here by way of Anaconda environment management to run code on Alpine.
-This readme will cover a background on the technologies and how to use the contents of an example project repository as though it were a project you were working on and wanting to run on Alpine.
-
-
-
-
-
-There are many routine tasks which can be automated to help save time and increase reproducibility in software development. GitHub Actions provides one way to accomplish these tasks using code-based workflows and related workflow implementations. This type of automation is commonly used to perform tests, builds (preparing for the delivery of the code), or delivery itself (sending the code or related artifacts where they will be used).
-
-
-
-
-
-Git provides a feature called branching which facilitates parallel and segmented programming work through commits with version control. Using branching enables both work concurrency (multiple people working on the same repository at the same time) as well as a chance to isolate and review specific programming tasks. This article covers some conceptual best practices with branching, reviewing, and merging code using Github.
-
-
-
-
-
-This article covers using the software technique of linting on R code in order to improve code quality, development velocity, and collaboration.
-
-
-
-
-
-Programming often involves long periods of problem solving which can sometimes lead to unproductive or exhausting outcomes. This article covers one way to avoid less productive time expense or protect yourself from overexhaustion through a technique called “timeboxing” (also sometimes referenced as “timeblocking”).
-
-
-
-
-
-Software documentation is sometimes treated as a less important or secondary aspect of software development. Treating documentation as code allows developers to version control the shared understanding and knowledge surrounding a project. Leveraging this paradigm also enables the use of tools and patterns which have been used to strengthen code maintenance. This article covers one such pattern: linting, or static analysis, for documentation treated like code.
-
-
-
-
-
-The act of creating software often involves many iterations of writing, personal collaborations, and testing. During this process it’s common to lose awareness of code which is no longer used, and thus may not be tested or otherwise linted. Unused code may contribute to “software decay”, the gradual diminishment of code quality or functionality. This post will cover software decay and strategies for addressing unused code to help keep your code quality high.
-
-
-
-
-
-Apache Arrow is a language-independent and high performance data format useful in many scenarios. DuckDB is an in-process SQL-based data management system which is Arrow-compatible. In addition to providing a SQLite-like database format, DuckDB also provides a standardized and high performance way to work with Arrow data where otherwise one may be forced to language-specific data structures or transforms.
-
-
-
-
-
-Diagrams can be a useful way to illuminate and communicate ideas. Free-form drawing or drag and drop tools are one common way to create diagrams. With this tip of the week we introduce another option: diagrams as code (DaC), or creating diagrams by using code.
-
-
-
-
-
-Have you ever found yourself spending hours formatting your code so it looks just right? Have you ever caught a duplicative import statement in your code? We recommend using open source linting tools to help avoid common issues like these and save time.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/preview/pr-29/feed.xml b/preview/pr-29/feed.xml
deleted file mode 100644
index cc92cff82d..0000000000
--- a/preview/pr-29/feed.xml
+++ /dev/null
@@ -1,2530 +0,0 @@
-Jekyll2024-01-25T20:56:54+00:00/set-website/preview/pr-29/feed.xmlSoftware Engineering TeamThe software engineering team of the Department of Biomedical Informatics at the University of Colorado AnschutzTip of the Month: Python Memory Management and Troubleshooting2024-01-22T00:00:00+00:002024-01-25T20:55:52+00:00/set-website/preview/pr-29/2024/01/22/Python-Memory-Management-and-TroubleshootingTip of the Week: Python Memory Management and Troubleshooting
-
-
-
-
-
-
-
Each month we seek to provide a software tip of the month geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out!
-
-
-
-
-
Introduction
-
-
-
Have you ever run Python code only to find it taking forever to complete or sometime abruptly ending with an error like: 123456 Killed or killed (program exited with code: 137)?
-You may have experienced memory resource or management challenges associated with these scenarios.
-This post will cover some computer memory definitions, how Python makes use of computer memory, and share some tools which may help with these types of challenges.
-
-
-
What is Memory?
-
-
Computer Memory
-
-
-
-
Computer memory is a type of computer resource available for use by software on a computer
-
-
Computer memory, also sometimes known as “RAM” or “random-access memory”, or “dynamic memory” is a type of resource used by computer software on a computer.
-“Computer memory stores information, such as data and programs for immediate use in the computer. … Main memory operates at a high speed compared to non-memory storage which is slower but less expensive and oftentimes higher in capacity. “ (Wikipedia: Computer memory).
-
-
-
Memory Blocks
-
-
-
-A.) All memory blocks available.
-
-
-
Block
Block
Block
-
-
-
-
-
-B.) Some memory blocks in use.
-
-
-
Block
Block
Block
-
-
-
-
-
-
Practical analogy
-
-
-
-
-C.) You have limited buckets to hold things.
-
-
-
🪣
🪣
🪣
-
-
-
-
-
-D.) Two buckets are used, the other remains empty.
-
-
-
🪣
🪣
🪣
-
-
-
-
-
-
-
Fixed-size memory blocks may be free or used at various times. They can be thought of like reusable buckets to hold things.
-
-
One way to organize computer memory is through the use of “fixed-size blocks”, also called “blocks”.
-Fixed-size memory blocks are chunks of memory of a certain byte size (usually all the same size).
-Memory blocks may be in use or free at different times.
-
-
-
-
Memory heaps help organize available memory on a computer for specific procedures. Heaps may have one or many memory pools.
-
-
Computer memory blocks may be organized in hierarchical layers to manage memory efficiently or towards a specific purpose.
-One top-level organization model for computer memory is through the use of heaps which help describe chunks of the total memory available on a computer for specific processes.
-These heaps may be private (only available to a specific software process) or shared (available to one or many software processes).
-Heaps are sometimes further segmented into pools which are areas of the heap which can be used for specific purposes.
-
-
Memory Allocator
-
-
-
-
Memory allocators help software reserve and free computer memory resources.
-
-
Memory management is a concept which helps enable the shared use of computer memory to avoid challenges such as memory overuse (where all memory is in use and never shared to other software).
-Computer memory management often occurs through the use of a memory allocator which controls how computer memory resources are used for software.
-Computer software is written to interact with memory allocators to use computer memory.
-Memory allocators may be used manually (with specific directions provided on when and how to use memory resources) or automatically (with an algorithmic approach of some kind).
-The memory allocator usually performs the following actions with memory (in addition to others):
-
-
-
“Allocation”: computer memory resource reservation (taking memory). This is sometimes also known as “alloc”, or “allocate memory”.
-
“Deallocation”: computer memory resource freeing (giving back memory for other uses). This is sometimes also known as “free”, or “freeing memory from allocation”.
-
-
-
Garbage Collection
-
-
-
-
Garbage collectors help free computer memory which is no longer referenced by software.
-
-
“Garbage collection (GC)” is used to describe a type of automated memory management.
-“The garbage collector attempts to reclaim memory which was allocated by the program, but is no longer referenced; such memory is called garbage.” (Wikipedia: Garbage collection (computer science)).
-A garbage collector often works in tandem with a memory allocator to help control computer memory resource usage in software development.
-
-
How Does Python Interact with Computer Memory?
-
-
Python Overview
-
-
-
-
A Python interpreter executes Python code and manages memory for Python procedures.
-
-
Python is an interpreted “high-level” programming language (Python: What is Python?).
-Interpreted languages are those which include an “interpreter” which helps execute code written in a particular way (Wikipedia: Interpreter (computing)).
-High-level languages such as Python often remove the requirement for software developers to manually perform memory management (Wikipedia: High-level programming language).
-
-
Python code is executed by a commonly pre-packaged and downloaded binary call the Python interpreter.
-The Python interpreter reads Python code and performs memory management as the code is executed.
-The CPython Python interpreter is the most commonly used interpreter for Python, and what’s use as a reference for other content here.
-There are also other interpreters such as PyPy, Jython, and IronPython which all handle memory differently than the CPython interpreter.
-
-
Python’s Memory Manager
-
-
-
-
The Python memory manager helps manage memory for Python code executed by the Python interpreter.
-
-
Memory is managed for Python software processes automatically (when unspecified) or manually (when specified) through the Python interpreter.
-The Python memory manager is an abstraction which manages memory for Python software processes through the Python interpreter (Python: Memory Management).
-From a high-level perspective, we assume variables and other operations written in Python will automatically allocate and deallocate memory through the Python interpreter when executed.
-The Python memory manager .
-Python’s memory manager performs work through various memory allocators and a garbage collector (or as configured with customizations) within a private Python memory heap.
-
-
Python’s Memory Allocators
-
-
-
-
The Python memory manager by default will use pymalloc internally or malloc from the system to allocate computer memory resources.
-
-
The Python memory manager allocates memory for use through memory allocators.
-Python may use one or many memory allocators depending on specifications in Python code and how the Python interpreter is configured (for example, see Python: Memory Management - Default Memory Allocators).
-One way to understand Python memory allocators is through the following distinctions.
-
-
-
“Python Memory Allocator” (pymalloc)
-The Python interpreter is packaged with a specialized memory allocator called pymalloc.
-“Python has a pymalloc allocator optimized for small objects (smaller or equal to 512 bytes) with a short lifetime.” (Python: Memory Management - The pymalloc allocator).
-Ultimately, pymalloc uses C malloc to implement memory work.
-
C dynamic memory allocator (malloc)
-When pymalloc is disabled or a memory requirements exceed pymalloc’s constraints, the Python interpreter will directly use a function from the C standard library called malloc.
-When malloc is used by the Python interpreter, it uses the system’s existing implementation of malloc.
-
-
-
-
-
pymalloc makes use of arenas to further organize pools within a computer memory heap.
-
-
It’s important to note that pymalloc adds additional abstractions to how memory is organized through the use of “arenas”.
-These arenas are specific to pymalloc purposes.
-pymalloc may be disabled through the use of a special environment variable called PYTHONMALLOC (for example, to use only malloc as seen below).
-This same environment variable may be used with debug settings in order to help troubleshoot in-depth questions.
-
-
Additional Python Memory Allocators
-
-
-
-
Python code and package dependencies may stipulate the use of additional memory allocators, such as mimalloc and jemalloc outside of the Python memory manager.
-
-
Python provides the capability of customizing memory allocation through the use of packages.
-See below for some notable examples of additional memory allocation possibilities.
-
-
-
NumPy Memory Allocation
-NumPyuses custom C-API’s which are backed by C dynamic memory allocation functions (alloc, free, realloc) to help address memory management.
-These interfaces can be controlled directly through NumPy to help manage memory effectively when using the package.
-
PyArrow Memory Allocators
-PyArrow provides the capability to use malloc, jemalloc, or mimalloc through the PyArrow Memory Pools group of functions.
-A default memory allocator is selected for use when PyArrow based on the operating system and the availability of the memory allocator on the system.
-The selection of a memory allocator for use with PyArrow can be influenced by how it performs on a particular system.
-a_string: 2
-(Because `a_string` is now referenced twice.)
-
-
-
-
-
-```python
-del reference_a_string
-```
-
-
-
-a_string: 1
-(Because the additional reference has been deleted.)
-
-
-
-</table>
-
-_Python reference counting at a simple level works through the use of object reference increments and decrements._
-
-As computer memory is allocated to Python processes the Python memory manager keeps track of these through the use of a [reference counter](https://en.wikipedia.org/wiki/Reference_counting).
-In Python, we could label this as an "Object reference counter" because all data in Python is represented by objects ([Python: Data model](https://docs.python.org/3/reference/datamodel.html#objects-values-and-types)).
-"... CPython counts how many different places there are that have a reference to an object. Such a place could be another object, or a global (or static) C variable, or a local variable in some C function." ([Python Developer's Guide: Garbage collector design](https://devguide.python.org/internals/garbage-collector/)).
-
-### Python's Garbage Collection
-
-
-
-_The Python garbage collector works as part of the Python memory manager to free memory which is no longer needed (based on reference count)._
-
-Python by default uses an optional garbage collector to automatically deallocate garbage memory through the Python interpreter in CPython.
-"When an object’s reference count becomes zero, the object is deallocated." ([Python Developer's Guide: Garbage collector design](https://devguide.python.org/internals/garbage-collector/))
-Python's garbage collector focuses on collecting garbage created by `pymalloc`, C memory functions, as well as other memory allocators like `mimalloc` and `jemalloc`.
-
-## Python Tools for Observing Memory Behavior
-
-### Python Built-in Tools
-
-```python
-import gc
-import sys
-
-# set gc in debug mode for detecting memory leaks
-gc.set_debug(gc.DEBUG_LEAK)
-
-# create an int object
-an_object = 1
-
-# show the number of uncollectable references via COLLECTED
-COLLECTED = gc.collect()
-print(f"Uncollectable garbage references: {COLLECTED}")
-
-# show the reference count for an object
-print(f"Reference count of `an_object`: {sys.getrefcount(an_object)}")
-```
-
-The [`gc` module](https://docs.python.org/3/library/gc.html) provides an interface to the Python garbage collector.
-In addition, the [`sys` module](https://docs.python.org/3/library/sys.html) provides many functions which provide information about references and other details about Python objects as they are executed through the interpreter.
-These functions and other packages can help software developers observe memory behaviors within Python procedures.
-
-### Python Package: Scalene
-
-
-
-
-[Scalene](https://github.com/plasma-umass/scalene) is a Python package for analyzing memory, CPU, and GPU resource consumption.
-It provides [a web interface](https://github.com/plasma-umass/scalene?tab=readme-ov-file#web-based-gui) to help visualize and understand how resources are consumed.
-Scalene provides suggestions on which portions of your code to troubleshoot through the web interface.
-Scalene can also be configured to work with [OpenAI](https://en.wikipedia.org/wiki/OpenAI) [LLM's](https://en.wikipedia.org/wiki/Large_language_model) by way of a an [OpenAI API provided by the user](https://github.com/plasma-umass/scalene?tab=readme-ov-file#ai-powered-optimization-suggestions).
-
-### Python Package: Memray
-
-
-
-
-[Memray](https://github.com/bloomberg/memray) is a Python package to track memory allocation within Python and compiled extension modules.
-Memray provides a high-level way to investigate memory performance and adds visualizations such as [flamegraphs](https://www.brendangregg.com/flamegraphs.html)(which contextualization of [stack traces](https://en.wikipedia.org/wiki/Stack_trace) and memory allocations in one spot).
-Memray seeks to provide a way to overcome challenges with tracking and understanding Python and other memory allocators (such as C, C++, or Rust libraries used in tandem with a Python process).
-
-## Concluding Thoughts
-
-It's worth mentioning that this article covers only a small fraction of how and what memory is as well as how Python might make use of it.
-Hopefully it clarifies the process and provides a way to get started with investigating memory within the software you work with.
-Wishing you the very best in your software journey with memory!
-
]]>dave-buntenTip of the Week: Codesgiving - Open-source Contribution Walkthrough2023-11-15T00:00:00+00:002024-01-25T20:55:52+00:00/set-website/preview/pr-29/2023/11/15/Codesgiving-Open-source-Contribution-WalkthroughTip of the Week: Codesgiving - Open-source Contribution Walkthrough
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
Introduction
-
-
-
-
-
Thanksgiving is a holiday practiced in many countries which focuses on gratitude for good harvests of the preceding year.
-In the United States, we celebrate Thanksgiving on the fourth Thursday of November each year often by eating meals we create together with others.
-This post channels the spirit of Thanksgiving by giving our thanks through code as a “Codesgiving”, acknowledging and creating better software together.
-
-
-
Giving Thanks to Open-source Harvests
-
-
-
-
Part of building software involves the use of code which others have built, maintained, and distributed for a wider audience.
-Using other people’s work often comes in the form of open-source “harvesting” as we find solutions to software challenges we face.
-Examples might include installing and depending upon Python packages from PyPI or R packages from CRAN within your software projects.
-
-
-
“Real generosity toward the future lies in giving all to the present.”
-- Albert Camus
-
-
-
These open-source projects have internal costs which are sometimes invisible to those who consume them.
-Every software project has an implied level of software gardening time costs involved to impede decay, practice continuous improvements, and evolve the work.
-One way to actively share our thanks for the projects we depend on is through applying our time towards code contributions on them.
-
-
Many projects are in need of additional people’s thinking and development time.
-Have you ever noticed something that needs to be fixed or desirable functionality in a project you use?
-Consider adding your contributions to open-source!
-
-
All Contributions Matter
-
-
-
-
Contributing to open-source can come in many forms and contributions don’t need to be gigantic to make an impact.
-Software often involves simplifying complexity.
-Simplification requires many actions beyond solely writing code.
-For example, a short walk outside, a conversation with someone, or a nap can sometimes help us with breakthroughs when it comes to development.
-By the same token, open-source benefits greatly from communications on discussion boards, bug or feature descriptions, or other work that might not be strictly considered “engineering”.
-
-
An Open-source Contribution Approach
-
-
-
-
The troubleshooting process as a workflow involving looped checks for verifying an issue and validating the solution fixes an issue.
-
-
It can feel overwhelming to find a way to contribute to open-source.
-Similar to other software methodology, modularizing your approach can help you progress without being overwhelmed.
-Using a troubleshooting approach like the above can help you break down big challenges into bite-sized chunks.
-Consider each step as a “module” or “section” which needs to be addressed sequentially.
-
-
Embrace a Learning Mindset
-
-
-
“Before you speak ask yourself if what you are going to say is true, is kind, is necessary, is helpful. If the answer is no, maybe what you are about to say should be left unsaid.”
-- Bernard Meltzer
-
-
-
Open-source contributions almost always entail learning of some kind.
-Many contributions happen solely in the form of code and text communications which are easily misinterpreted.
-Assume positive intent and accept input from others while upholding your own ideas to share successful contributions together.
-Prepare yourself by intentionally opening your mind to input from others, even if you’re sure you’re absolutely “right”.
-
-
-
-
-
-
-
Before communicating, be sure to use Bernard Meltzer’s self-checks mentioned above.
-
-
-
Is what I’m about to say true?
-
-
Have I taken time to verify the claims in a way others can replicate or understand?
-
-
-
Is what I’m about to say kind?
-
-
Does my intention and communication channel kindness (and not cruelty)?
-
-
-
Is what I’m about to say necessary?
-
-
Do my words and actions here enable or enhance progress towards a goal (would the outcome be achieved without them)?
-
-
-
Is what I’m about to say helpful?
-
-
How does my communication increase the quality or sustainability of the project (or group)?
-
-
-
-
-
-
-
-
Setting Software Scheduling Expectations
-
-
-
-
-
-
-
-
-
-
-
-
Suggested ratio of time spent by type of work for an open-source contribution.
-
-
-
1/3 planning (~33%)
-
1/6 coding (~16%)
-
1/4 component and system testing (25%)
-
1/4 code review, revisions, and post-actions (25%)
-
-
-
This modified rule of thumb from The Mythical Man Month can assist with how you structure your time for an open-source contribution.
-Notice the emphasis on planning and testing and keep these in mind as you progress (the actual programming time can be small if adequate time has been spent on planning).
-Notably, the original time fractions are modified here with the final quarter of the time spent suggested as code review, revisions, and post-actions.
-Planning for the time expense of the added code review and related elements assists with keeping a learning mindset throughout the process (instead of feeling like the review is a “tack-on” or “optional / supplementary”).
-A good motto to keep in mind throughout this process is Festina lente, or “Make haste, slowly.” (take care to move thoughtfully and as slowly as necessary to do things correctly the first time).
-
-
Planning an Open-source Contribution
-
-
Has the Need Already Been Reported?
-
-
-
-
Be sure to check whether the bug or feature has already been reported somewhere!
-In a way, this is a practice of “Don’t repeat yourself” (DRY) where we attempt to avoid repeating the same block of code (in this case, the “code” can be understood as natural language).
-For example, you can look on GitHub Issues or GitHub Discussions with a search query matching the rough idea of what you’re thinking about.
-You can also use the GitHub search bar to automatically search multiple areas (including Issues, Discussions, Pull Requests, etc.) when you enter a query from the repository homepage.
-If it has been reported already, take a look to see if someone has made a code contribution related to the work already.
-
-
An open discussion or report of the need doesn’t guarantee someone’s already working on a solution.
-If there aren’t yet any code contributions and it doesn’t look like anyone is working on one, consider volunteering to take a further look into the solution and be sure to acknowledge any existing discussions.
-If you’re unsure, it’s always kind to mention your interest in the report and ask for more information.
-
-
Is the Need a Bug or Feature?
-
-
-
-
-
-
-
-
-
-
-
One way to help solidify your thinking and the approach is to consider whether what you’re proposing is a bug or a feature.
-A software bug is considered something which is broken or malfunctioning.
-A software feature is generally considered new functionality or a different way of doing things than what exists today.
-There’s often overlap between these, and sometimes they can inspire branching needs, but individually they usually are more of one than the other.
-If you can’t decide whether your need is a bug or a feature, consider breaking it down into smaller sub-components so they can be more of one or the other.
-Following this strategy will help you communicate the potential for contribution and also clarify the development process (for example, a critical bug might be prioritized differently than a nice-to-have new feature).
-
-
Reporting the Need for Change
-
-
# Using `function_x` with `library_y` causes `exception_z`
-
-## Summary
-
-As a `library_y` research software developer I want to use `function_x`
-for my data so that I can share data for research outcomes.
-
-## Reproducing the error
-
-This error may be seen using Python v3.x on all major OS's using
-the following code snippet:
-...
-
-
-
-
An example of a user story issue report with imagined code example.
-
-
Open-source needs are often best reported through written stories captured within a bug or feature tracking system (such as GitHub Issues) which if possible also include example code or logs.
-One template for reporting issues is through a “user story”.
-A user story typically comes in the form: As a < type of user >, I want < some goal > so that < some reason >. (Mountain Goat Software: User Stories).
-Alongside the story, it can help to add in a snippet of code which exemplifies a problem, new functionality, or a potential adjacent / similar solution.
-As a general principle, be as specific as you can without going overboard.
-Include things like programming language version, operating system, and other system dependencies that might be related.
-
-
Once you have a good written description of the need, be sure to submit it where it can be seen by the relevant development community.
-For GitHub-based work, this is usually a GitHub Issue, but can also entail discussion board posts to gather buy-in or consensus before proceeding.
-In addition to the specifics outlined above, also recall the learning mindset and Bernard Meltzer’s self-checks, taking time to acknowledge especially the potential challenges and already attempted solutions associated with the description (conveying kindness throughout).
-
-
What Happens After You Submit a Bug or Feature Report?
-
-
-
-
When making open-source contributions, sometimes it can also help to mention that you’re interested in resolving the issue through a related pull request and review.
-Oftentimes open-source projects welcome new contributors but may have specific requirements.
-These requirements are usually spelled out within a CONTRIBUTING.md document found somewhere in the repository or the organization level documentation.
-It’s also completely okay to let other contributors build solutions for the issue (like we mentioned before, all contributions matter, including the reporting of bugs or features themselves)!
-
-
Developing and Testing an Open-source Contribution
-
-
Creating a Development Workspace
-
-
-
-
Once ready to develop a solution for the reported need in the open-source project you’ll need a place to version your updates.
-This work generally takes place through version control on focused branches which are named in a way that relates to the focus.
-When working on GitHub, this work also commonly takes place on forked repository copies.
-Using these methods helps isolate your changes from other work that takes place within the project.
-It also can help you track your progress alongside related changes that might take place before you’re able to seek review or code merges.
-
-
Bug or Feature Verification with Test-driven Development
-
-
-
-
-
-
-
One can use a test-driven development approach as numbered steps (Wikipedia).
-
-
-
-
Add or modify a test which checks for a bug fix or feature addition
-
Run all tests (expecting the newly added test content to fail)
-
Write a simple version of code which allows the tests to succeed
-
Verify that all tests now pass
-
Return to step 3, refactoring the code as needed
-
-
-
-
-
-
-
-
If you decide to develop a solution for what you reported, one software strategy which can help you remain focused and objective is test-driven development.
-Using this pattern sets a “cognitive milestone” for you as you develop a solution to what was reported.
-Open-source projects can have many interesting components which could take time and be challenging to understand.
-The addition of the test and related development will help keep you goal-orientated without getting lost in the “software forest” of a project.
-
-
Prefer Simple Over Complex Changes
-
-
-
…
-Simple is better than complex.
-Complex is better than complicated.
-…
-- PEP 20: The Zen of Python
-
-
-
Further channeling step 3. from test-driven development above, prefer simple changes over more complex ones (recognizing that the absolute simplest can take iteration and thought).
-Some of the best solutions are often the most easily understood ones (where the code addition or changes seem obvious afterwards).
-A “simplest version” of the code can often be more quickly refactored and completed than devising a “perfect” solution the first time.
-Remember, you’ll very likely have the help of a code review before the code is merged (expect to learn more and add changes during review!).
-
-
It might be tempting to address more than one bug or feature at the same time.
-Avoid feature creep as you build solutions - stay focused on the task at hand!
-Take note of things you notice on your journey to address the reported needs.
-These can be become additional reported bugs or features which could be addressed later.
-Staying focused with your development will save you time, keep your tests constrained, and (theoretically) help reduce the time and complexity of code review.
-
-
Developing a Solution
-
-
-
-
Once you have a test in place for the bug fix or feature addition it’s time to work towards developing a solution.
-If you’ve taken time to accomplish the prior steps before this point you may already have a good idea about how to go about a solution.
-If not, spend some time investigating the technical aspects of a solution, optionally adding this information to the report or discussion content for further review before development.
-Use timeboxing techniques to help make sure the time you spend in development is no more than necessary.
-
-
Code Review, Revisions, and Post-actions
-
-
Pull Requests and Code Review
-
-
When your code and new test(s) are in a good spot it’s time to ask for a code review.
-It might feel tempting to perfect the code.
-Instead, consider whether the code is “good enough” and would benefit from someone else providing feedback.
-Code review takes advantage of a strength of our species: collaborative & multi-perspectival thinking.
-Leverage this in your open-source experience by seeking feedback when things feel “good enough”.
-
-
-
-
-
-
Demonstrating Pareto Principle “vital few” through a small number of changes to achieve 80% of the value associated with the needs.
-
-
One way to understand “good enough” is to assess whether you have reached what the Pareto Principle terms as the “vital few” causes.
-The Pareto Principle states that roughly 80% of consequences come from 20% of causes (the “vital few”).
-What are the 20% changes (for example, as commits) which are required to achieve 80% of the desired intent for development with your open-source contribution?
-When you reach those 20% of the changes, consider opening a pull request to gather more insight about whether those changes will suffice and how the remaining effort might be spent.
-
-
As you go through the process of opening a pull request, be sure to follow the open-source CONTRIBUTING.md document documentation related to the project; each one can vary.
-When working on GitHub-based projects, you’ll need to open a pull request on the correct branch (usually upstream main).
-If you used a GitHub issue to help report the issue, mention the issue in the pull request description using the #issue number (for example #123 where the issue link would look like: https://github.com/orgname/reponame/issues/123) reference to help link the work to the reported need.
-This will cause the pull request to show up within the issue and automatically create a link to the issue from the pull request.
-
-
Code Revisions
-
-
-
“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”
-- Antoine de Saint-Exupery
-
-
-
You may be asked to update your code based on automated code quality checks or reviewer request.
-Treat these with care; embrace learning and remember that this step can take 25% of the total time for the contribution.
-When working on GitHub forks or branches, you can make additional commits directly on the development branch which was used for the pull request.
-If your reviewers requested changes, re-request their review once changes have been made to help let them know the code is ready for another look.
-
-
Post-actions and Tidying Up Afterwards
-
-
-
-
Once the code has been accepted by the reviewers and through potential automated testing suite(s) the content is ready to be merged.
-Oftentimes this work is completed by core maintainers of the project.
-After the code is merged, it’s usually a good idea to clean up your workspace by deleting your development branch and syncing with the upstream repository.
-While it’s up to core maintainers to decide on report closure, typically the reported need content can be closed and might benefit from a comment describing the fix.
-Many of these steps are considered common courtesy but also, importantly, assist in setting you up for your next contributions!
-
-
Concluding Thoughts
-
-
Hopefully the above helps you understand the open-source contribution process better.
-As stated earlier, every little part helps!
-Best wishes on your open-source journey and happy Codesgiving!
-
-
References
-
-
-
Top Image: Französischer Obstgarten zur Erntezeit (Le verger) by Charles-François Daubigny (cropped). (Source: Wikimedia Commons)
-
]]>dave-buntenTip of the Week: Data Quality Validation through Software Testing Techniques2023-10-04T00:00:00+00:002024-01-25T20:55:52+00:00/set-website/preview/pr-29/2023/10/04/Data-Quality-ValidationTip of the Week: Data Quality Validation through Software Testing Techniques
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
Diagram showing input, in-process data, and output data as a workflow.
-
-
-
Data orientated software development can benefit from a specialized focus on varying aspects of data quality validation.
-We can use software testing techniques to validate certain qualities of the data in order to meet a declarative standard (where one doesn’t need to guess or rediscover known issues).
-These come in a number of forms and generally follow existing software testing concepts which we’ll expand upon below.
-This article will cover a few tools which leverage these techniques for addressing data quality validation testing.
-
-
Data Quality Testing Concepts
-
-
Hoare Triple
-
-
-
-
One concept we’ll use to present these ideas is Hoare logic, which is a system for reasoning on software correctness.
-Hoare logic includes the idea of a Hoare triple ($ {\displaystyle {P}C{Q}} $) where $ {\displaystyle {P}} $ is an assertion of precondition, $ {\displaystyle \ C} $ is a command, and $ {\displaystyle {Q}} $ is a postcondition assertion.
-Software development using data often entails (sometimes assumed) assertions of precondition from data sources, a transformation or command which changes the data, and a (sometimes assumed) assertion of postcondition in a data output or result.
-
-
Design by Contract
-
-
-
-
Data testing through design by contract over Hoare triple.
-
-
Hoare logic and Software correctness help describe design by contract (DbC), a software approach involving the formal specification of “contracts” which help ensure we meet our intended goals.
-DbC helps describe how to create assertions when proceeding through Hoare triplet states for data.
-These concepts provide a framework for thinking about the tools mentioned below.
-
-
Data Component Testing
-
-
-
-
Diagram showing data contracts as generalized and reusable “component” testing being checked through contracts and raising an error if they aren’t met or continuing operations if they are met.
-
-
We often need to verify a certain component’s surrounding data in order to ensure it meets minimum standards.
-The word “component” is used here from the context of component-based software design to group together reusable, modular qualities of the data where sometimes we don’t know (or want) to specify granular aspects (such as schema, type, column name, etc).
-These components often are implied by software which will eventually use the data, which can emit warnings or errors when they find the data does not meet these standards.
-Oftentimes these components are contracts checking postconditions of earlier commands or procedures, ensuring the data we receive is accurate to our intention.
-We can avoid these challenges by creating contracts for our data to verify the components of the result before it reaches later stages.
-
-
Examples of these data components might include:
-
-
-
The dataset has no null values.
-
The dataset has no more than 3 columns.
-
The dataset has a column called numbers which includes numbers in the range of 0-10.
-
-
-
Data Component Testing - Great Expectations
-
-
"""
-Example of using Great Expectations
-Referenced with modifications from:
-https://docs.greatexpectations.io/docs/tutorials/quickstart/
-"""
-importgreat_expectationsasgx
-
-# get gx DataContext
-# see: https://docs.greatexpectations.io/docs/terms/data_context
-context=gx.get_context()
-
-# set a context data source
-# see: https://docs.greatexpectations.io/docs/terms/datasource
-validator=context.sources.pandas_default.read_csv(
- "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
-)
-
-# add and save expectations
-# see: https://docs.greatexpectations.io/docs/terms/expectation
-validator.expect_column_values_to_not_be_null("pickup_datetime")
-validator.expect_column_values_to_be_between("passenger_count",auto=True)
-validator.save_expectation_suite()
-
-# checkpoint the context with the validator
-# see: https://docs.greatexpectations.io/docs/terms/checkpoint
-checkpoint=context.add_or_update_checkpoint(
- name="my_quickstart_checkpoint",
- validator=validator,
-)
-
-# gather checkpoint expectation results
-checkpoint_result=checkpoint.run()
-
-# show the checkpoint expectation results
-context.view_validation_result(checkpoint_result)
-
-
-
Example code leveraging Python package Great Expectations to perform various data component contract validation.
-
-
Great Expectations is a Python project which provides data contract testing features through the use of component called “expectations” about the data involved.
-These expectations act as a standardized way to define and validate the component of the data in the same way across different datasets or projects.
-In addition to providing a mechanism for validating data contracts, Great Expecations also provides a way to view validation results, share expectations, and also build data documentation.
-See the above example for a quick code reference of how these work.
-
-
Data Component Testing - Assertr
-
-
# Example using the Assertr package
-# referenced with modifications from:
-# https://docs.ropensci.org/assertr/articles/assertr.html
-library(dplyr)
-library(assertr)
-
-# set our.data to reference the mtcars dataset
-our.data<-mtcars
-
-# simulate an issue in the data for contract specification
-our.data$mpg[5]<-our.data$mpg[5]*-1
-
-# use verify to validate that column mpg >= 0
-our.data%>%
- verify(mpg>=0)
-
-# use assert to validate that column mpg is within the bounds of 0 to infinity
-our.data%>%
- assert(within_bounds(0,Inf),mpg)
-
-
-
Example code leveraging R package Assertr to perform various data component contract validation.
-
-
Assertr is an R project which provides similar data component assertions in the form of verify, assert, and insist methods (see here for more documentation).
-Using Assertr enables a similar but more lightweight functionality to that of Great Expectations.
-See the above for an example of how to use it in your projects.
-
-
Data Schema Testing
-
-
-
-
Diagram showing data contracts as more granular specifications via “schema” testing being checked through contracts and raising an error if they aren’t met or continuing operations if they are met.
-
-
Sometimes we need greater specificity than what a data component can offer.
-We can use data schema testing contracts in these cases.
-The word “schema” here is used from the context of database schema, but oftentimes these specifications are suitable well beyond solely databases (including database-like formats like dataframes).
-While reuse and modularity are more limited with these cases, they can be helpful for efforts where precision is valued or necessary to accomplish your goals.
-It’s worth mentioning that data schema and component testing tools often have many overlaps (meaning you can interchangeably use them to accomplish both tasks).
-
-
Data Schema Testing - Pandera
-
-
"""
-Example of using the Pandera package
-referenced with modifications from:
-https://pandera.readthedocs.io/en/stable/try_pandera.html
-"""
-importpandasaspd
-importpanderaaspa
-frompandera.typingimportDataFrame,Series
-
-
-# define a schema
-classSchema(pa.DataFrameModel):
- item:Series[str]=pa.Field(isin=["apple","orange"],coerce=True)
- price:Series[float]=pa.Field(gt=0,coerce=True)
-
-
-# simulate invalid dataframe
-invalid_data=pd.DataFrame.from_records(
- [{"item":"applee","price":0.5},
- {"item":"orange","price":-1000}]
-)
-
-
-# set a decorator on a function which will
-# check the schema as a precondition
-@pa.check_types(lazy=True)
-defprecondition_transform_data(data:DataFrame[Schema]):
- print("here")
- returndata
-
-
-# precondition schema testing
-try:
- precondition_transform_data(invalid_data)
-exceptpa.errors.SchemaErrorsasschema_excs:
- print(schema_excs)
-
-# inline or implied postcondition schema testing
-try:
- Schema.validate(invalid_data)
-exceptpa.errors.SchemaErrorasschema_exc:
- print(schema_exc)
-
-
-
Example code leveraging Python package Pandera to perform various data schema contract validation.
-
-
DataFrame-like libraries like Pandas can verified using schema specification contracts through Pandera (see here for full DataFrame library support).
-Pandera helps define specific columns, column types, and also has some component-like features.
-It leverages a Pythonic class specification, similar to data classes and pydantic models, making it potentially easier to use if you already understand Python and DataFrame-like libraries.
-See the above example for a look into how Pandera may be used.
-
-
Data Schema Testing - JSON Schema
-
-
# Example of using the jsonvalidate R package.
-# Referenced with modifications from:
-# https://docs.ropensci.org/jsonvalidate/articles/jsonvalidate.html
-
-schema<-'{
- "$schema": "https://json-schema.org/draft/2020-12/schema",
- "title": "Hello World JSON Schema",
- "description": "An example",
- "type": "object",
- "properties": {
- "hello": {
- "description": "Provide a description of the property here",
- "type": "string"
- }
- },
- "required": [
- "hello"
- ]
-}'
-
-# create a schema contract for data
-validate<-jsonvalidate::json_validator(schema,engine="ajv")
-
-# validate JSON using schema specification contract and invalid data
-validate("{}")
-
-# validate JSON using schema specification contract and valid data
-validate("{'hello':'world'}")
-
-
-
JSON Schema provides a vocabulary way to validate schema contracts for JSON documents.
-There are several implementations of the vocabulary, including Python package jsonschema, and R package jsonvalidate.
-Using these libraries allows you to define pre- or postcondition data schema contracts for your software work.
-See above for an R based example of using this vocabulary to perform data schema testing.
-
-
Shift-left Data Testing
-
-
-
-
Earlier portions of this article have covered primarily data validation of command side-effects and postconditions.
-This is commonplace in development where data sources usually are provided without the ability to validate their precondition or definition.
-Shift-left testing is a movement which focuses on validating earlier in the lifecycle if and when possible to avoid downstream issues which might occur.
-
-
Shift-left Data Testing - Data Version Control (DVC)
-
-
-
-
Data sources undergoing frequent changes become difficult to use because we oftentimes don’t know when the data is from or what version it might be.
-This information is sometimes added in the form of filename additions or an update datetime column in a table.
-Data Version Control (DVC) is one tool which is specially purposed to address this challenge through source control techniques.
-Data managed by DVC allows software to be built in such a way that version preconditions are validated before reaching data transformations (commands) or postconditions.
-
-
Shift-left Data Testing - Flyway
-
-
-
-
Database sources can leverage an idea nicknamed “database as code” (which builds on a similar idea about infrastructure as code) to help declare the schema and other elements of a database in the same way one would code.
-These ideas apply to both databases and also more broadly through DVC mentioned above (among other tools) via the concept “data as code”.
-Implementing this idea has several advantages from source versioning, visibility, and replicability.
-One tool which implements these ideas is Flyway which can manage and implement SQL-based files as part of software data precondition validation.
-A lightweight alternative to using Flyway is sometimes to include a SQL file which creates related database objects and becomes data documentation.
]]>dave-buntenTip of the Week: Python Packaging as Publishing2023-09-05T00:00:00+00:002024-01-25T20:55:52+00:00/set-website/preview/pr-29/2023/09/05/Python-Packaging-as-PublishingTip of the Week: Python Packaging as Publishing
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
Python packaging is the craft of preparing for and reaching distribution of your Python work to wider audiences. Following conventions for packaging help your software work become more understandable, trustworthy, and connected (to others and their work). Taking advantage of common packaging practices also strengthens our collective superpowers: collaboration. This post will cover preparation aspects of packaging, readying software work for wider distribution.
The practice of Python packaging efforts is similar to that of publishing a book. Consider how a bag of text is different from a book. How and why are these things different?
-
-
-
A book has commonly understood sequencing of content (i.e. copyright page, then title page, then body content pages…).
-
A book often cites references and acknowledges other work explicitly.
-
A book undergoes a manufacturing process which allows the text to be received in many places the same way.
-
-
-
-
-
These can be thought of metaphors when it comes to packaging in Python. Books have a smell which sometimes comes from how it was stored, treated, or maintained. While there are pleasant book smells, they might also smell soggy from being left in the rain or stored without maintenance for too long. Just like books, software can sometimes have negative code smells indicating a lack of care or less sustainable condition. Following good packaging practices helps to avoid unwanted code smells while increasing development velocity, maintainability of software through understandability, trustworthiness of the content, and connection to other projects.
-
-
-
-
-
-
-
Note: these techniques can also work just as well for inner source collaboration (private or proprietary development within organizations)! Don’t hesitate to use these on projects which may not be public facing in order to make development and maintenance easier (if only for you).
A Python package is a collection of modules (.py files) that usually include an “initialization file” __init__.py. This post will cover the craft of packaging which can include one or many packages.
Python Packaging today generally assumes a specific directory design.
-Following this convention generally improves the understanding of your code. We’ll cover each of these below.
The README.md file is a markdown file with documentation including project goals and other short notes about installation, development, or usage. The README.md file is akin to a book jacket blurb which quickly tells the audience what the book will be about.
-
The LICENSE.txt file is a text file which indicates licensing details for the project. It often includes information about how it may be used and protects the authors in disputes. The LICENSE.txt file can be thought of like a book’s copyright page. See https://choosealicense.com/ for more details on selecting an open source license.
-
The pyproject.toml file is a Python-specific TOML file which helps organize how the project is used and built for wider distribution. The pyproject.toml file is similar to a book’s table of contents, index, and printing or production specification.
The docs directory is used for in-depth documentation and related documentation build code (for example, when building documentation websites, aka “docsites”). The docs directory includes information similar to a book’s “study guide”, providing content surrounding how to best make use of and understand the content found within.
-
The src directory includes primary source code for use in the project. Python projects generally use a nested package directory with modules and sub-packages. The src directory is like a book’s body or general content (perhaps thinking of modules as chapters or sections of related ideas).
-
The tests directory includes testing code for validating functionality of code found in the src directory. The above follows pytest conventions. The tests directory is for code which acts like a book’s early reviewers or editors, making sure that if you change things in src the impacts remain as expected.
-
-
-
Common directory structure examples
-
-
The Python directory structure described above can be witnessed in the wild from the following resources. These can serve as a great resource for starting or adjusting your own work.
Building an understandable body of content helps tremendously with audience trust. What else can we do to enhance project trust? The following elements can help improve an audience’s trust in packaged Python work.
-
-
Source control authenticity
-
-
-
-
Be authentic! Fill out your profile to help your audience know the author and why you do what you do. See here for GitHub’s documentation on filling out your profile. Doing this may seem irrelevant but can go a long way to making technical work more relatable.
-
-
-
Add a profile picture of yourself or something fun.
-
Set your profile description to information which is both professionally accurate and unique to you.
-
Show or link to work which you feel may be relevant or exciting to those in your audience.
-
-
-
Staying up to date with supported Python releases
-
-
-
-
Use Python versions which are supported (this changes over time).
-Python versions which are end-of-life may be difficult to support and are a sign of code decay for projects. Specify the version of Python which is compatiable with your project by using environment specifications such as pyproject.toml files and related packaging tools (more on this below).
Staying up to date with supported releases oftentimes can result in performance or other similar benefits (later versions usually include improvements!).
-
-
-
Security linting and visible checks with GitHub Actions
-
-
-
-
Use security vulnerability linters to help prevent undesirable or risky processing for your audience. Doing this both practical to avoid issues and conveys that you care about those using your package!
gitleaks: checks for sensitive passwords, keys, or tokens
-
-
-
-
-
Combining GitHub actions with security linters and tests from your software validation suite can add an observable ✅ for your project.
-This provides the audience with a sense that you’re transparently testing and sharing results of those tests.
Connection: personal and inter-package relationships
-
-
-
-
Understandability and trust set the stage for your project’s connection to other people and projects. What can we do to facilitate connection with our project? Use the following techniques to help enhance your project’s connection to others and their work.
-
-
Acknowledging authors and referenced work with CITATION.cff
-
-
-
-
Add a CITATION.cff file to your project root in order to describe project relationships and acknowledgements in a standardized way. The CFF format is also GitHub compatible, making it easier to cite your project.
-
-
-
This is similar to a book’s credits, acknowledgements, dedication, and author information sections.
Provide a CONTRIBUTING.md file to your project root so as to make clear support details, development guidance, code of conduct, and overall documentation surrounding how the project is governed.
Environment management reproducibility as connected project reality
-
-
-
-
Code without an environment specification is difficult to run in a consistent way. This can lead to “works on my machine” scenarios where different things happen for different people, reducing the chance that people can connect with a shared reality for how your code should be used.
-
-
-
“But why do we have to switch the way we do things?”
-We’ve always been switching approaches (software approaches evolve over time)! A brief history of Python environment and packaging tooling:
-
-
-
distutils, easy_install + setup.py (primarily used during 1990’s - early 2000’s)
-
pip, setup.py + requirements.txt (primarily used during late 2000’s - early 2010’s)
-
poetry + pyproject.toml (began use around late 2010’s - ongoing)
-
-
-
-
Using Python poetry for environment and packaging management
-
-
-
-
Poetry is one Pythonic environment and packaging manager which can help increase reproducibility using pyproject.toml files. It’s one of many other alternatives such as hatch and pipenv.
After installation, Poetry gives us the ability to initialize a directory structure similar to what we presented earlier by using the poetry new ... command. If you’d like a more interactive version of the same, use the poetry init command to fill out various sections of your project with detailed information.
Using the poetry new ... command also initializes the content of our pyproject.toml file with opinionated details (following the recommendation from earlier in the article regarding declared Python version specification).
-
-
poetry dependency management
-
-
user@machine % poetry add pandas
-
-Creating virtualenv package-name-1STl06GY-py3.9 in /pypoetry/virtualenvs
-Using version ^2.1.0 for pandas
-
-...
-
-Writing lock file
-
-
-
We can add dependencies directly using the poetry add ... command. This command also provides the possibility of using a group flag (for example poetry add pytest --group testing) to help organize and distinguish multiple sets of dependencies.
-
-
-
A local virtual environment is managed for us automatically.
-
A poetry.lock file is written when the dependencies are installed to help ensure the version you installed today will be what’s used on other machines.
-
The poetry.lock file helps ensure reproducibility when dealing with dependency version ranges (where otherwise we may end up using different versions which match the dependency ranges but observe different results).
-
-
-
Running Python from the context of poetry environments
-
-
% poetry run python -c"import pandas; print(pandas.__version__)"
-
-2.1.0
-
This allows us to quickly run code through the context of the project’s environment.
-
Poetry can automatically switch between multiple environments based on the local directory structure.
-
We can also the environment as a “shell” (similar to virtualenv’s activate) with the poetry shell command which enables us to leverage a dynamic session in the context of the poetry environment.
Even if we don’t reach wider distribution on PyPI or elsewhere, source code managed by pyproject.toml and poetry can be used for “manual” distribution (with reproducible results) from GitHub repositories. When we’re ready to distribute pre-built packages on other networks we can also use the following:
-
-
% poetry build
-
-Building package-name (0.1.0)
- - Building sdist
- - Built package_name-0.1.0.tar.gz
- - Building wheel
- - Built package_name-0.1.0-py3-none-any.whl
-
-
-
Poetry readies source-code and pre-compiled versions of our code for distribution platforms like PyPI by using the poetry build ... command. We’ll cover more on these files and distribution steps with a later post!
]]>dave-buntenTip of the Week: Using Python and Anaconda with the Alpine HPC Cluster2023-07-07T00:00:00+00:002024-01-25T20:55:52+00:00/set-website/preview/pr-29/2023/07/07/Using-Python-and-Anaconda-with-the-Alpine-HPC-ClusterTip of the Week: Using Python and Anaconda with the Alpine HPC Cluster
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
Diagram showing common benefits of Alpine and HPC clusters.
-
-
Alpine is a High Performance Compute (HPC) cluster.
-HPC environments provide shared computer hardware resources like memory, CPU, GPU or others to run performance-intensive work.
-Reasons for using Alpine might include:
-
-
-
Compute resources: Leveraging otherwise cost-prohibitive amounts of memory, CPU, GPU, etc. for processing data.
-
Long-running jobs: Completing long-running processes which may take hours or days to complete.
-
Collaborations: Sharing a single implementation environment for reproducibility within a group (avoiding “works on my machine” inconsistency issues).
-
-
-
How does Alpine work?
-
-
-
-
Diagram showing high-level user workflow and Alpine components.
-
-
Alpine’s compute resources are used through compute nodes in a system called Slurm.
-Slurm is a system that a large number of users to run jobs on a cluster of computers; the system figures out how to use all the computers in the cluster to execute all the user’s jobs fairly (i.e., giving each user approximately equal time and resources on the cluster). A job is a request to run something, e.g. a bash script or a program, along with specifications about how much RAM and CPU it needs, how long it can run, and how it should be executed.
-
-
Slurm’s role in general is to take in a job (submitted via the sbatch command) and put it into a queue (also called a “partition” in Slurm). For each job in the queue, Slurm constantly tries to find a computer in the cluster with enough resources to run that job, then when an available computer is found runs the program the job specifies on that computer. As the program runs, Slurm records its output to files and finally reports the program’s exit status (either completed or failed) back to the job manager.
-
-
Importantly, jobs can either be marked as interactive or batch. When you submit an interactive job, sbatch will pause while waiting for the job to start and then connect you to the program, so you can see its output and enter commands in real time. On the other hand, a batch job will return immediately; you can see the progress of your job using squeue, and you can typically see the output of the job in the folder from which you ran sbatch unless you specify otherwise.
-Data for or from Slurm work may be stored temporarily on local storage or on user-specific external (remote) storage.
-
-
-
-
-
-
-
Wait, what are “nodes”?
-
-
A simplified way to understand the architecture of Slurm on Alpine is through login and compute “nodes” (computers).
-Login nodes act as a place to prepare and submit jobs which will be completed on compute nodes. Login nodes are never used to execute Slurm jobs, whereas compute nodes are exclusively accessed via a job.
-Login nodes have limited resource access and are not recommended for running procedures.
-
-
-
-
-
One can interact with Slurm on Alpine by use of Slurm interfaces and directives.
-A quick way of accessing Alpine resources is through the use of the acompile command, which starts an interactive job on a compute node with some typical default parameters for the job. Since acompile requests very modest resources (1 hour and 1 CPU core at the time of writing), you’ll typically quickly be connected to a compute node. For more intensive or long-lived interactive jobs, consider using sinteractive, which allows for more customization: Interactive Jobs.
-One can also access Slurm directly through various commands on Alpine.
Using Alpine effectively involves knowing how to leverage Slurm.
-A simplified way to understand how Slurm works is through the following sequence.
-Please note that some steps and additional complexity are omitted for the purposes of providing a basis of understanding.
-
-
-
Create a job script: build a script which will configure and run procedures related to the work you seek to accomplish on the HPC cluster.
-
Submit job to Slurm: ask Slurm to run a set of commands or procedures.
-
Job queue: Slurm will queue the submitted job alongside others (recall that the HPC cluster is a shared resource), providing information about progress as time goes on.
-
Job processing: Slurm will run the procedures in the job script as scheduled.
-
Job completion or cancellation: submitted jobs eventually may reach completion or cancellation states with saved information inside Slurm regarding what happened.
-
-
-
How do I store data on Alpine?
-
-
-
-
Data used or produced by your processed jobs on Alpine may use a number of different data storage locations.
-Be sure to follow the Acceptable data storage and use policies of Alpine, avoiding the use of certain sensitive information and other items.
-These may be distinguished in two ways:
-
-
-
-
Alpine local storage (sometimes temporary): Alpine provides a number of temporary data storage locations for accomplishing your work.
-⚠️ Note: some of these locations may be periodically purged and are not a suitable location for long-term data hosting (see here for more information)!
-Storage locations available (see this link for full descriptions):
-
-
-
Home filesystem: 2 GB of backed up space under /home/$USER (where $USER is your RMACC or Alpine username).
-
Projects filesystem: 250 GB of backed up space under /projects/$USER (where $USER is your RMACC or Alpine username).
-
Scratch filesystem: 10 TB (10,240 GB) of space which is not backed up under /scratch/alpine/$USER (where $USER is your RMACC or Alpine username).
-
-
-
-
External / remote storage: Users are encouraged to explore external data storage options for long-term hosting.
-Examples may include the following:
-
-
-
PetaLibrary: subsidized external storage host from University of Colorado Boulder’s Research Computing (requires specific arrangements outside of Alpine).
Others: additional options include third-party “storage as a service” offerings like Google Drive or Dropbox and/or external servers maintained by other groups.
-
-
-
-
-
How do I send or receive data on Alpine?
-
-
-
-
Diagram showing external data storage being used to send or receive data on Alpine local storage.
-
-
Data may be sent to or gathered from Alpine using a number of different methods.
-These may vary contingent on the external data storage being referenced, the code involved, or your group’s available resources.
-Please reference the following documentation from the University of Colorado Boulder’s Research Computing regarding data transfers: The Compute Environment - Data Transfer.
-Please note: due to the authentication configuration of Alpine many local or SSH-key based methods are not available for CU Anschutz users.
-As a result, Globus represents one of the best options available (see 3. 📂 Transfer data results below). While the Globus tutorial in this document describes how you can download data from Alpine to your computer, note that you can also use Globus to transfer data to Alpine from your computer.
-
-
Implementation
-
-
-
-
Diagram showing how an example project repository may be used within Alpine through primary steps and processing workflow.
-
-
Use the following steps to understand how Alpine may be used with an example project repository to run example Python code.
-
-
0. 🔑 Gain Alpine access
-
-
First you will need to gain access to Alpine.
-This access is provided to members of the University of Colorado Anschutz through RMACC and is separate from other credentials which may be provided by default in your role.
-Please see the following guide from the University of Colorado Boulder’s Research Computing covering requesting access and generally how this works for members of the University of Colorado Anschutz.
[username@xsede.org@login-ciX ~]$ cd /projects/$USER
-[username@xsede.org@login-ciX username@xsede.org]$ git clone https://github.com/CU-DBMI/example-hpc-alpine-python
-Cloning into 'example-hpc-alpine-python'...
-... git output ...
-[username@xsede.org@login-ciX username@xsede.org]$ ls-l example-hpc-alpine-python
-... ls output ...
-
-
-
An example of what this preparation section might look like in your Alpine terminal session.
-
-
Next we will prepare our code within Alpine.
-We do this to balance the fact that we may develop and source control code outside of Alpine.
-In the case of this example work, we assume git as an interface for GitHub as the source control host.
-
-
Below you’ll find the general steps associated with this process.
Change directory into the Projects filesystem (generally we’ll assume processed data produced by this code are large enough to warrant the need for additional space): cd /projects/$USER
-
Use git (built into Alpine by default) commands to clone this repo: git clone https://github.com/CU-DBMI/example-hpc-alpine-python
-
Verify the contents were received as desired (this should show the contents of an example project repository): ls -l example-hpc-alpine-python
-
-
-
-
-
-
-
-
-
-
-
-
What if I need to authenticate with GitHub?
-
-
There are times where you may need to authenticate with GitHub in order to accomplish your work.
-From a GitHub perspective, you will want to use either GitHub Personal Access Tokens (PAT) (recommended by GitHub) or SSH keys associated with the git client on Alpine.
-Note: if you are prompted for a username and password from git when accessing a GitHub resource, the password is now associated with other keys like PAT’s instead of your user’s password (reference).
-See the following guide from GitHub for more information on how authentication through git to GitHub works:
[username@xsede.org@login-ciX ~]$ sbatch --export=CSV_FILEPATH="/projects/$USER/example_data.csv" example-hpc-alpine-python/run_script.sh
-[username@xsede.org@login-ciX username@xsede.org]$ tail-f example-hpc-alpine-python.out
-... tail output (ctrl/cmd + c to cancel) ...
-[username@xsede.org@login-ciX username@xsede.org]$ head-n 2 example_data.csvexample-hpc-alpine-python
-... data output ...
-
-
-
An example of what this implementation section might look like in your Alpine terminal session.
-
-
After our code is available on Alpine we’re ready to run it using Slurm and related resources.
-We use Anaconda to build a Python environment with specified packages for reproducibility.
-The main goal of the Python code related to this work is to create a CSV file with random data at a specified location.
-We’ll use Slurm’s sbatch command, which submits batch scripts to Slurm using various options.
-
-
-
Use the sbatch command with exported variable CSV_FILEPATH. sbatch --export=CSV_FILEPATH="/projects/$USER/example_data.csv" example-hpc-alpine-python/run_script.sh
-
After a short moment, use the tail command to observe the log file created by Slurm for this sbatch submission. This file can help you understand where things are at and if anything went wrong. tail -f example-hpc-alpine-python.out
-
Once you see that the work has completed from the log file, take a look at the top 2 lines of the data file using the head command to verify the data arrived as expected (column names with random values): head -n 2 example_data.csv
-
-
-
3. 📂 Transfer data results
-
-
-
-
Diagram showing how example_data.csv may be transferred from Alpine to a local machine using Globus solutions.
-
-
Now that the example data output from the Slurm work is available we need to transfer that data to a local system for further use.
-In this example we’ll use Globus as a data transfer method from Alpine to our local machine.
-Please note: always be sure to check data privacy and policy which change the methods or storage locations you may use for your data!
During installation, you will be prompted to login to Globus. Use your ACCESS credentials to login.
-
During installation login, note the label you provide to Globus. This will be used later, referenced as “Globus Connect Personal label”.
-
Ensure you add and (importantly:) provide write access to a local directory via Globus Connect Personal - Preferences - Access where you’d like the data to be received from Alpine to your local machine.
Configure File Manager left side (source selection)
-
-
Within the Globus web interface on the File Manager tab, use the Collection input box to search or select “CU Boulder Research Computing ACCESS”.
-
Within the Globus web interface on the File Manager tab, use the Path input box to enter: /projects/your_username_here/ (replacing “your_username_here” with your username from Alpine, including the “@” symbol if it applies).
-
-
-
Configure File Manager right side (destination selection)
-
-
Within the Globus web interface on the File Manager tab, use the Collection input box to search or select the __Globus Connect Personal label you provided in earlier steps.
-
Within the Globus web interface on the File Manager tab, use the Path input box to enter the local path which you made accessible in earlier steps.
-
-
-
Begin Globus transfer
-
-
Within the Globus web interface on the File Manager tab on the left side (source selection), check the box next to the file example_data.csv.
-
Within the Globus web interface on the File Manager tab on the left side (source selection), click the “Start ▶️” button to begin the transfer from Alpine to your local directory.
-
After clicking the “Start ▶️” button, you may see a message in the top right with the message “Transfer request submitted successfully”. You can click the link to view the details associated with the transfer.
-
After a short period, the file will be transferred and you should be able to verify the contents on your local machine.
]]>dave-buntenTip of the Week: Automate Software Workflows with GitHub Actions2023-03-15T00:00:00+00:002024-01-25T20:55:52+00:00/set-website/preview/pr-29/2023/03/15/Automate-Software-Workflows-with-Github-ActionsTip of the Week: Automate Software Workflows with GitHub Actions
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
There are many routine tasks which can be automated to help save time and increase reproducibility in software development. GitHub Actions provides one way to accomplish these tasks using code-based workflows and related workflow implementations. This type of automation is commonly used to perform tests, builds (preparing for the delivery of the code), or delivery itself (sending the code or related artifacts where they will be used).
-flowchart LR
- start((start)) --> action
- action["action(s)"] --> en((end))
- style start fill:#6EE7B7
- style en fill:#FCA5A5
-
-
-
-
An example workflow.
-
-
Workflows consist of sequenced activities used by various systems. Software development workflows help accomplish work the same way each time by using what are commonly called “workflow engines”. Generally, workflow engines are provided code which indicate beginnings (what triggers a workflow to begin), actions (work being performed in sequence), and an ending (where the workflow stops). There are many workflow engines, including some which help accomplish work alongside version control.
-
-
GitHub Actions
-
-
-flowchart LR
- subgraph workflow [GitHub Actions Workflow Run]
- direction LR
- action["action(s)"] --> en((end))
- start((event\ntrigger))
- end
- start --> action
- style start fill:#6EE7B7
- style en fill:#FCA5A5
-
-
-
A diagram showing GitHub Actions as a workflow.
-
-
GitHub Actions is a feature of GitHub which allows you to run workflows in relation to your code as a continuous integration (including automated testing, builds, and deployments) and general automation tool. For example, one can use GitHub Actions to make sure code related to a GitHub Pull Request passes certain tests before it is allowed to be merged. GitHub Actions may be specified using YAML files within your repository’s .github/workflows directory by using syntax specific to Github’s workflow specification. Each YAML file under the .github/workflows directory can specify workflows to accomplish tasks related to your software work. GitHub Actions workflows may be customized to your own needs, or use an existing marketplace of already-created Actions.
-
-
-
-
GitHub provides an “Actions” tab for each repository which helps visualize and control Github Actions workflow runs. This tab shows a history of all workflow runs in the repository. For each run, it shows whether it was run successful or not, the associated logs, and controls to cancel or re-run it.
-
-
-
GitHub Actions Examples
-GitHub Actions is sometimes better understood with examples. See the following references for a few basic examples of using GitHub Actions in a simulated project repository.
-
-
-
1.example-action.yml: demonstrates how to run a snippet of Python code in a basic GitHub Actions workflow.
-
2.run-python-file.yml: demonstrates how to reliably reproduce the environment by installing dependencies using Poetry, and then run a Python file in that environment.
-flowchart LR
- subgraph container ["local simulation container(s)"]
- direction LR
- subgraph workflow [GitHub Actions Workflow Run]
- direction LR
- start((event\ntrigger))
- action --> en((end))
- end
- end
- start --> action
- act\[Run Act] -.-> |Simulate\ntrigger| start
- style start fill:#6EE7B7
- style en fill:#FCA5A5
-
-
-
A diagram showing how GitHub Actions workflows may be triggered from Act
-
-
One challenge with GitHub Actions is a lack of standardized local testing tools. For example, how will you know that a new GitHub Actions workflow will function as expected (or at all) without pushing to the GitHub repository? One third-party tool which can help with this is Act. Act uses Docker images which require Docker Desktop to simulate what running a GitHub Action workflow within your local environment. Using Act can sometimes avoid guessing what will occur when a GitHub Action worklow is added to your repository. See Act’s installation documentation for more information on getting started with this tool.
-
-
Nested Workflows with GitHub Actions
-
-
-flowchart LR
-
- subgraph action ["Nested Workflow (Dagger, etc)"]
- direction LR
- actions
- start2((start)) --> actions
- actions --> en2((end))
- en2((end))
- end
- subgraph workflow2 [Local Environment Run]
- direction LR
- run2[run workflow]
- en3((end))
- start3((event\ntrigger))
- end
- subgraph workflow [GitHub Actions Workflow Run]
- direction LR
- start((event\ntrigger))
- run[run workflow]
- en((end))
- end
-
- start --> run
- start3 --> run2
- action -.-> run
- run --> en
- run2 --> en3
- action -.-> run2
- style start fill:#6EE7B7
- style start2 fill:#D1FAE5
- style start3 fill:#6EE7B7
- style en fill:#FCA5A5
- style en2 fill:#FFE4E6
- style en3 fill:#FCA5A5
-
-
-
A diagram showing how GitHub Actions may leverage nested workflows with tools like Dagger.
-
-
There are times when GitHub Actions may be too constricting or Act may not accurately simulate workflows. We also might seek to “write once, run anywhere” (WORA) to enable flexible development on many environments. One workaround to this challenge is to use nested workflows which are compatible with local environments and GitHub Actions environments. Dagger is one tool which enables programmatically specifying and using workflows this way. Using Dagger allows you to trigger workflows on your local machine or GitHub Actions with the same underlying engine, meaning there are fewer inconsistencies or guesswork for developers (see here for an explanation of how Dagger works).
-
-
There are also other alternatives to Dagger you may want to consider based on your usecase, preference, or interest. Earthly is similar to Dagger and uses “earthfiles” as a specification. Both Dagger and Earthly (in addition to GitHub Actions) use container-based approaches, which in-and-of themselves present additional alternatives outside the scope of this article.
-
-
-
GitHub Actions with Nested Workflow Example
-Reference this example for a brief demonstration of how GitHub Actions and Dagger may be used together.
-
-
-
4.run-matrixed-pytest-dagger.yml: demonstrates how to run matrixed Python versions for confirming passing pytest tests using GitHub Actions and Dagger together. A GitHub Actions matrix strategy is used to span concurrent work while retaining the reproducibility from Dagger task specification.
-
-
-
-
Closing Remarks
-
-
Using GitHub Actions through the above methods can help automate your technical work and increase the quality of your code with sometimes very little additional effort. Saving time through this form of automation can provide additional flexibility accomplish more complex work which requires your attention (perhaps using timeboxing techniques). Even small amounts of time saved can turn into large opportunities for other work. On this note, be sure to explore how GitHub Actions can improve things for your software endeavors.
]]>dave-buntenTip of the Week: Branch, Review, and Learn2023-02-13T00:00:00+00:002024-01-25T20:55:52+00:00/set-website/preview/pr-29/2023/02/13/Branch-Review-and-LearnTip of the Week: Branch, Review, and Learn
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
Git provides a feature called branching which facilitates parallel and segmented programming work through commits with version control. Using branching enables both work concurrency (multiple people working on the same repository at the same time) as well as a chance to isolate and review specific programming tasks. This article covers some conceptual best practices with branching, reviewing, and merging code using Github.
-
-
-
-
Please note: the content below represents one opinion in a larger space of Git workflow concepts (it’s not perfect!). Developer cultures may vary on these topics; be sure to acknowledge people and culture over exclusive or absolute dedication to what is found below.
-flowchart LR
- subgraph Course
- direction LR
- open["open\nassignment"]
- turn_in["review\nassignment"]
- end
- subgraph Student [" Student"]
- direction LR
- work["completed\nassignment"]
- end
- open -.-> turn_in
- open --> |works towards| work
- work --> |seeks review| turn\_in
-
-
-
-
An example course and student assignment workflow.
-
-
Git branching practices may be understood in context with similar workflows from real life. Consider a student taking a course, where an assignment is given to them to complete. In addition to the steps shown in the diagram above, it’s important to think about why this pattern is beneficial:
-
-
-
Completing an assignment allows us as social, inter-dependent beings to present new findings which enable learning and amalgamation of additional ideas from others.
-
The timebound nature of assignments enables us to practice some form of timeboxing so as to minimize tasks which may take too much time.
-
Segmenting applied learning in distinct, goal-orientated chunks helps make larger topics easier to understand.
An example git diagram showing assignment branch based off main.
-
-
Following the course assignment workflow, the diagram above shows an in-progress assignment branch based off of the main branch. When the assignment branch is created, we bring into it everything we know from main (the course) so far in the form of commits, or groups of changes to various files. Branching allows us to make consistent and well described changes based on what’s already happened without impacting others work in the meantime.
-
-
-
Branching best practices:
-
-
-
Keep the name and work with branches dedicated to a specific and focused purpose. For example: a branch named fix-links-in-docs might entail work related to fixing HTTP links within documentation.
-
Consider the use of Github Forks (along with branches within the fork) to help further isolate and enrich work potential. Forks also allow remixing existing work into new possibilities.
-
festina lente or “make haste, slowly”: Commits on any branch represent small chunks of a cohesive idea which will eventually be brought to main. It is often beneficial to be consistent with small, gradual commits to avoid a rushed or incomplete submission. The same applies more generally for software; taking time upfront to do things well can mean time saved later.
An example git diagram showing assignment branch being merged with main after a review.
-
-
The diagram above depicts a merge from the assignment branch to pull the changes into the main branch, simulating an assignment being returned for review within a course. While merges may be forced without review, it’s a best practice create a Pull Request (PR) Review (also known as a Merge Request (MR) on some systems) and then ask other members of your team to review it. Doing this provides a chance to make revisions before code changes are “finalized” within the main branch.
-
-
-
Github provides special tools for reviews which can assist both the author and reviewer:
-
-
-
Keep code changes intended for review small, enabling reviewers to reason through the work to more quickly provide feedback and practicing incremental continuous improvement (it may be difficult to address everything at once!). This also may denote the git history for a repository in a clearer way.
-
Github comments:Overall review comments (encompassing all work from the branch) and Inline comments (inquiring about individual lines of code) may be provided. Inline comments may also include code suggestions, which allows for code-based revision suggestions that may be committed directly to the branch using markdown codeblocks (``suggestion `).
-
Github issues:Creating issues from comments allows the creation of new repository issues to address topics outside of the current PR.
An example git diagram showing the main branch after the assignment branch has been merged (and removed).
-
-
Changes may be made within the assignment branch until the work is in a state where the authors and reviewers are satisfied. At this point, the branch changes may be merged into main. Approvals are sometimes provided informally (for ex., with a comment: “LGTM (looks good to me)!”) or explicitly (for ex., approvals within Github) to indicate or enable branch merge readiness . After the merge, changes may continue to be made in a similar way (perhaps accounting for concurrently branched work elsewhere). Generally, a merged branch may be removed afterwards to help maintain an organized working environment (see Github PR branch removal).
-
-
-
Github provides special tools for merging:
-
-
-
Decide which merge strategy is appropriate (there are many!): There are many merge strategies within Github (merge commits, squash merges, and rebase merging). Take time to understand them and choose which one works best.
-
Consider using branch protection to automate merge requirements: The main or other branches may be “protected” against merges using branch protection rules. These rules can require reviewer approvals or automatic status checks to pass before changes may be merged.
-
Use merge queuing to manage multiple PR’s: When there are many unmerged PR’s, it can sometimes be difficult to document and ensure each are merged in a desired sequence. Consider using merge queues to help with this process.
-
-
-
-
Additional Resources
-
-
The links below may provide additional guidance on using these git features, including in-depth coverage of various features and related configuration.
]]>dave-buntenTip of the Week: Software Linting with R2023-01-30T00:00:00+00:002024-01-25T20:55:52+00:00/set-website/preview/pr-29/2023/01/30/Software-Linting-with-RTip of the Week: Software Linting with R
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
This article covers using the software technique of linting on R code in order to improve code quality, development velocity, and collaboration.
-
-
-
-
TLDR (too long, didn’t read);
-Use software linting (static analysis) practices on your R code with existing packages lintr and styler (among others). These linters may be applied using pre-commit in your local development environment or as continuous tests using for example Github Actions.
-
-
Treating R as Software
-
-
-
“Many users think of R as a statistics system. We prefer to think of it as an environment within which statistical techniques are implemented.”
The R programming language is sometimes treated as only a statistics system instead of software. This treatment can sometimes lead to common issues in development which are experienced in other languages. Addressing R as software enables developers to enhance their work by taking benefit from existing concepts applied to many other languages.
-
-
Linting with R
-
-
-flowchart LR
- write\[Write R code] --> |check| check\[Check code with linters]
- check --> |revise| write
-
-
-
-
Workflow loop depicting writing R code and revising with linters.
-
-
Software linting, or static analysis, is one way to ensure a minimum level of code quality without writing new tests. Linting checks how your code is structured without running it to make sure it abides by common language paradigms and logical structures. Using linting tools allows a developer to gain quick insights about their code before it is viewed or used by others.
-
-
One way to lint your R code is by using the lintr package. The lintr package is also complementary of the styler pacakge, which formats the syntax of R code in a consistent way. Both of these can be used independently or as part of continuous quality checks for R code repositories.
-
-
Automated Linting Checks with R
-
-
-flowchart LR
- subgraph development
- write
- check
- end
- subgraph linters
- direction LR
- lintr
- styler
- end
- check <-.- linters
- write\[Write R code] --> |check| check\[Check code with pre-commit]
- check --> |revise| write
-
-
-
Workflow showing development with pre-commit using multiple linters.
-
-
lintr and styler can be incorporated into automated checks to help make sure linting (or other steps) are always used with new code. One tool which can help with this is pre-commit, which acts as both a local development tool in addition to providing observability within source control (more on this later).
-
-
Using pre-commit locally enables quick feedback loops using one or many checkers (such as lintr, styler, or others). Pre-commit may be used through the use of git hooks or manually using pre-commit run ... from a command-line. See this example of pre-commit checks with R for an example of multiple pre-commit checks for R code.
-
-
Continuous and Observable Testing for R
-
-
-flowchart LR
- subgraph development [local development]
- direction LR
- write
- check
- commit
- end
- subgraph remote[Github repository]
- direction LR
- action["Check code (remotely)"]
- end
- write\[Write R code] --> |check| check\[Check code with pre-commit]
- check --> |revise| write
- check --> commit[commit + push]
- commit --> |optional trigger| action
- check -.-> |perform same checks| action
-
-
-
Workflow showing pre-commit used as continuous testing tool with Github.
-
-
Pre-commit linting checks can also be incorporated into continuous testing performed on your repository. One way to do this is using Github Actions. Github Actions provides a programmatic way to specify automatic steps taken as changes occur to a repository.
-
-
Pre-commit provides an example Github Action which will automatically check and alert repository maintainers when code challenges are detected. Using pre-commit in this way allows R developers to ensure lintr checks are performed on any new work checked into a repository. This can have benefits towards decreasing pull request (PR) review time and standardize how code collaboration takes place for R developers.
-
-
Resources
-
-
Please see the following the resources on this topic.
]]>dave-buntenTip of the Week: Timebox Your Software Work2023-01-17T00:00:00+00:002024-01-25T20:55:52+00:00/set-website/preview/pr-29/2023/01/17/Timebox-Your-Software-WorkTip of the Week: Timebox Your Software Work
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
Programming often involves long periods of problem solving which can sometimes lead to unproductive or exhausting outcomes. This article covers one way to avoid less productive time expense or protect yourself from overexhaustion through a technique called “timeboxing” (also sometimes referenced as “timeblocking”).
-
-
-
-
TLDR (too long, didn’t read);
-Use timeboxing techniques such as Pomodoro® or 52/17 to help modularize your software work to ensure you don’t fall victim to Parkinson’s Law. Timeboxing may also map well to Github Issues, which allows your software tasks to be further aligned, documented, and chunked in collaboration with others.
-
-
Controlling Work Time Expansion
-
-
-
-
Have you ever spent more time than you thought you would on a task? An adage which helps explain this phenomenon is Parkinson’s Law:
-
-
-
“… work expands so as to fill the time available for its completion.”
-
-
-
The practice of writing software is not protected from this “law”. It may be affected by it in sometimes worse ways during long periods of uninterrupted programming where we may have an inclination to forget productive goals.
-
-
One way to address this is through the use of timeboxing techiques. Timeboxing sets a fixed limit to the amount of time one may spend on a specific activity. One can use timeboxing to systematically address many tasks, for example, as with the Pomodoro® Technique (developed by Francesco Cirillo) or 52/17 rule. While there are many ways to apply timeboxing, make sure to balance activity with short breaks and focus switches to help ensure we don’t become overwhelmed.
-
-
Timeboxing Means Modularization
-
-
Timeboxing has an auxiliary benefit of framing your work as objective and oftentimes smaller chunks (we have to know what we’re timeboxing in order to use this technique). Creating distinct chunks of work applies for both our daily time schedule as well as code itself. This concept is more broadly called “modularization” and helps to distinguish large portions of work (whether in real life or in code) as smaller and more maintainable chunks.
-
-
-
-
-
-
# Goals
-- Finish writing paper
-
-
-
-
-
-
-
Vague and possibly large task
-
-
-
-
-
-
# Goals
-- Finish writing paper
- - Create paper outline
- - Finish writing introduction
- - Check for dead hyperlinks
- - Request internal review
-
-
-
Modular and more understandable tasks
-
-
-
-
-
Breaking down large amounts of work as smaller chunks within our code helps to ensure long-term maintainability and understandability. Similarly, keeping our tasks small can help ensure our goals are achievable and understandable (to ourselves or others). Without this modularity, tasks can be impossible to achieve (subjective in nature) or very difficult to understand. Stated differently, taking many small steps can lead to a big change in an organized, oftentimes less exhausting way (related graphic).
List of example version control repository issues with associated time duration.
-
-
The parallels between the time we give a task and related code can work towards your benefit. For example, Github Issues can be created to outline a timeboxed task which relates to a distinct chunk of code to be created, updated, or fixed. Once development tasks have been outlined as issues, a developer can use timeboxing to help organize how much time to allocate on each issue.
-
-
Using Github Issues in this way provides a way to observe task progress associated with one or many repositories. It also increases collaborative opportunities for task sizing and description. For example, if a task looks too large to complete in a reasonable amount of time, developers may work together to break the task down into smaller modules of work.
-
-
Be Kind to Yourself: Take Breaks
-
-
While timeboxing is often a conversation about how to be more productive, it’s also worth remembering: take breaks to be kind to yourself and more effective. Some studies and thought leadership have shown that taking breaks may be necessary to avoid performance decreases and impacts to your health. There’s also some indication that taking breaks may lead to better work. See below for just a few examples:
]]>dave-buntenTip of the Week: Linting Documentation as Code2023-01-03T00:00:00+00:002024-01-25T20:55:52+00:00/set-website/preview/pr-29/2023/01/03/Linting-Documentation-as-CodeTip of the Week: Linting Documentation as Code
-
-
-
-
-
-
-
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views
-expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you
-have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to
-#software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
-
-
-
-
-
-
-
Software documentation is sometimes treated as a less important or secondary aspect of software development. Treating documentation as code allows developers to version control the shared understanding and knowledge surrounding a project. Leveraging this paradigm also enables the use of tools and patterns which have been used to strengthen code maintenance. This article covers one such pattern: linting, or static analysis, for documentation treated like code.
-
-
-
-
TLDR (too long, didn’t read);
-There are many linting tools available which enable quick revision of your documentation. Try using codespell for spelling corrections, mdformat for markdown file formatting corrections, and vale for more complex editorial style or natural language assessment within your documentation.
-
-
Spelling Checks
-
-
-
-
-
-
<!--- readme.md --->
-## Example Readme
-
-Thsi project is a wokr in progess.
-Code will be updated by the team very often.
-
-(CU Anschutz)[https://www.cuanschutz.edu/]
-
Example showing codespell detection of mispelled words
-
-
-
-
-
Spelling checks may be used to automatically detect incorrect spellings of words within your documentation (and code!). Codespell is one library which can lint your word spelling. Codespell may be used through the command-line and also through a pre-commit hook.
-
-
Markdown Format Linting
-
-
-
-
-
-
<!--- readme.md --->
-## Example Readme
-
-This project is a work in progress.
-Code will be updated by the team very often.
-
-(CU Anschutz)[https://www.cuanschutz.edu/]
-
-
-
Example readme.md with markdown issues
-
-
-
-
-
% markdownlint readme.md
-readme.md:2 MD041/first-line-heading/first-line-h1
-First line in a file should be a top-level heading
-[Context: "## Example Readme"]
-readme.md:6:5 MD011/no-reversed-links Reversed link
-syntax [(link)[https://www.cuanschutz.edu/]]
-
-
-
-
Example showing markdownlint detection of issues
-
-
-
-
-
The format of your documentation files may also be linted for common issues. This may catch things which are otherwise hard to see when editing content. It may also improve the overall web accessibility of your content, for example, through proper HTML header order and image alternate text. Markdownlint is one library which can be used to find issues within markdown files.
-
-
Additional and similar resources to explore in this area:
<!--- readme.md --->
-# Example Readme
-
-This project is a work in progress.
-Code will be updated by the team very often.
-
-[CU Anschutz](https://www.cuanschutz.edu/)
-
-
-
Example readme.md with questionable editorial style
-
-
-
-
-
% vale readme-example.md
-readme-example.md
-2:12 error Did you really mean 'Readme'? Vale.Spelling
-5:11 warning 'be updated' may be passive write-good.Passive
- voice. Use active voice if you
- can.
-5:34 warning 'very' is a weasel word! write-good.Weasel
-
-
-
Example showing vale warnings and errors
-
-
-
-
-
Maintaining consistent editorial style and grammar may also be a focus within your documentation. These issues are sometimes more difficult to detect and more opinionated in nature. In some cases, organizations publish guides on this topic (see Microsoft Writing Style Guide, or Google Developer Documenation Style Guide). Some of the complexity of writing style may be linted through tools like Vale. Using common configurations through Vale can unify how language is used within your documentation by linting for common style and grammar.
-
-
Additional and similar resources to explore in this area:
-
-
-
textlint - similar to Vale with a modular approach
-
-
-
Resources
-
-
Please see the following the resources on this topic.
-
-
-
codespell - a code and documentation spell checker.
We support the labs and individuals within the Department by developing high quality web applications, web servers, data visualizations, data pipelines, and much more.
Dave Bunten is a multiskilled research data engineer with a passion for expanding human potential through software design, collaboration, and innovation.
-He brings a diverse background in higher education, healthcare, and software development to help orchestrate scientific data pipelines.
-Outside of work, Dave enjoys hiking, biking, painting, and spending time with family.
Faisal has been working as a full-stack developer for the past fifteen years. He was the lead developer on svip.ch (the Swiss Variant Interpretation Platform), a variant database with a curation interface. He has also worked with the BRCA Challenge on BRCA Exchange as a mobile, web, and backend/pipeline developer.
-
-
Since starting at the University of Colorado Anschutz in July 2021, he has been primarily engaged in porting applications to Google Cloud, including profiling apps for their resource requirements, writing IaC descriptions of the application stacks, and adding instrumentation.
Vince is a staff frontend developer in the Department.
-His job is to take the studies, projects, and ideas of his colleagues and turn them into beautiful, dynamic, fully-realized web applications.
-His work includes app development, website development, UI/UX design, logo design, and anything else visual or creative.
-Outside of the lab, Vince is a freelance music composer for indie video games and the YouTube channel 3Blue1Brown.
A web app that enables researchers to run a general-purpose computational workflow for
-characterizing the molecular evolution and phylogeny of their proteins of interest.
A frontend web app and supporting backend server that allows users to explore how a word changes in meaning over time based on natural language processing machine learning.
A migration of the all Monarch Initiative backend and associated services from physical hardware to Google Cloud, including automated provisioning and deployment via Terraform, Ansible, and Docker Swarm.
Automates the parsing and transformation of a KGX archive into graphdb-ready formats. After the archive is converted, automates provisioning and deployment of Neo4j and Blazegraph instances from a KGX archive.