Guide on contributing to Open Source

Posted on November 21st 2020

Hello, y’all. I am Samesh Lakhotia. I have a fair experience/knowledge in contributing to open source projects and I would like to share the same so that more people from our campus start contributing to open source and get involved in the open-source community. I am currently pursuing my Google Summer of Code 2020(GSoC) project at HydraEcosystem.org.

There are already tons of articles on “How to contribute to open source”. You can have a look at them for a general overview. I wanted to share some of my personal learnings/hurdles I had along my journey and how you could overcome them when you start contributing to open source projects. This article is mainly for people who are interested in contributing to open source and/or want to participate in open-source programs like Google Summer of Code, Outreachy, etc.

Something I would want you all to know beforehand

Contributing to open-source projects as a beginner will feel intimidating. This is because there is no kind of defined “syllabus” or “9 steps to becoming an open-source guru” guide. There are some steps general steps you can follow but you can sometimes feel a bit lost in the process because there’s a very steep learning curve involved. That’s why, the most important aspect, by far, when you start contributing to open source is that you need to be ready to learn new things on the go. But, I guarantee you, the learning you get through it is invaluable as a software developer.

How do I start?

The first question I often get asked is how to go about starting to contribute to open source.

Prerequisites:

Well, you need to have some basic prerequisites before you can start contributing to open source. They basically include:

Fair knowledge(by fair, I just mean not being a complete beginner) of any programming language of your choice. It might be better if you are also comfortable with any framework/library in that language mainly because that will open up doors to allow you to contribute to more projects.
Fair experience in working with any version control system like GIT.

One of the biggest hurdles most people face is how to choose a project to start contributing to.

There are many approaches here, but I would like to share what worked for me.

In my opinion, the best projects to start contributing are the big libraries/frameworks you have used in your preferred language of choice. They have the best open source practices followed and the project will be active. So if you are stuck in any issue, the project maintainers will always be there to offer help.
My language of choice was Python. And at that time, I was exploring the data science and machine learning domain. Due to that work, I had some experience with some of the famous libraries in this domain which included pandas, numpy and matplotlib.
So, I started my contribution to these projects.

Now one thing to keep in mind is that start with some very simple issues. Your first contribution need not be the next big feature in this library. In fact, I did suggest starting(and that’s also how I started) with some non-code contribution like fixing typos in the documentation or updating the documentation. Also, most projects have their issues tagged with labels such as ‘good-first-issue’ or ‘beginner’. You could start with them. These will get you more comfortable with the Github workflow. Also, The feeling of getting your first PR merged is absolutely ecstatic💙

What next?

So now you have a decent number of PRs merged in some projects which solve beginner issues. You also feel fairly confident using GIT.

The next hurdle I faced was how do I go from solving simple issues to more intermediate ones. This is by far the biggest challenge I had.

The problem is that there is no exact answer I can give for this. The key is patience and perseverance. I can give you some tips which could help:

[Absolutely essential] Learn to use a debugger in the language/framework you are comfortable in. Using a debugger will really help you a ton when you are trying to fix any bug. Using print statements to log stuff can only help you until a point. I was in the python ecosystem, so I learnt pdb. You should go ahead and learn whichever is the most suitable for your ecosystem.
For solving intermediate issues, I would actually recommend you switch your project to a not so big one. The problem with huge projects is that sometimes their code might get a little too complex when you are at that stage. They are really good to get you comfortable with GIT and open-source culture in general. So what I did, was switched my project, and started working on some intermediate-ish issues in another not-so-big project. That project was still in its alpha stage and I could understand most of the codebase.
Try to explain to the project maintainers about your issue and where you are exactly stuck. They will always be there to help whenever you are stuck. They are going to be your best friends in open source.
Don’t get disheartened if you are not able to solve an issue for a long time. The first issue I solved, took me around a week. Try to stick with that issue for some time but if you are not able to solve it, try your hand at another issue😉.

Now what?

If you have solved some intermediate issues, you would have got a good knowledge of open source development culture. You would have also learnt a ton in the process.

Now, for people interested in programs like Google Summer of Code. The process is exactly the same with a few extra steps. But the core idea is the same, try to contribute by fixing bugs and adding features.

One very important thing to understand in your whole journey is that continue only if you like the process. I personally loved the thrill I got whenever I was going after a new issue or getting my PR merged. The satisfaction of seeing my code being used by so many other people was what drove me.
There won’t be any point in continuing just because you want to gain any kind of tag on your resume. I would recommend rather find something you like doing and work hard at that.

I hope this article helps someone getting into open-source. I would love to answer any queries that you have in the comments.
If you have any other questions on open-source development / GSoC, feel free to contact me.

Do ping me once your first PR gets merged😉! Happy developing.

Google Summer of Code 2020 Summary - Hydra Ecosystem

Posted on August 27th 2020 Edited on November 21st 2020

Hi there!

I’m Samesh Lakhotia and this post aims to summarize my journey throughout Google Summer of Code 2020. It has been three months of intensive learning and a remarkable experience as a developer which was divided into three coding phases. This has been one the most productive and enriching summers of my life.
I’ll divide each phase citing its goal, small description and core PR links attached to each. I’ve also made detailed blog posts during each phase(bi-weekly) and they’re linked in the end of each section. The last part has some diverse auxiliary merged contributions done along the way.

Our community has different tools including a working server that can be built from a Hydra Doc, a.k.a hydrus, a Python Agent to work with that server, a.k.a hydra-python-agent and the core library to handle all the major parsing of Hydra Docs, a.k.a hydra-python-core. I and my GSoC colleague Priyanshu Nayan had a series of improvements to make, in which Priyanshu was leading Hydra Agents’ and Core library changes and I was responsible for the hydrus side.

[Phase 1] - `hydrus` database architecture improvements

The first phase of GSoC was marked by intensive study and discussions. I was in charge of optimising the existing database architecture of hydrus. The basic idea was to go from a generic database schema to one where different type of resources are stored in different tables to improve scalability and efficiency. To start this task, I had to get really comfortable with the existing database architecture.

Then, I had detailed discussed with the mentors on the possible architecture and then proceeded to the coding and lastly documenting it. References for this phase are as below:

Goal:

Research and implement better multiple-table architecture - database architecture where different type of resources are stored in different tables to improve scalability and efficiency

Pull Requests:

[PR 479] Make db tables for each resource in apidoc

Extra Reference links:

Older linked issues: Research better multiple-table architecture

Detailed blog posts during this phase:

[Phase 2] - Treating collections as a Resource

The second phase of GSoC consisted of mainly two new features and enhancements. These were some old issues in hydrus which needed to get fixed from a long time.
These included the ability to treat collection as a resource (and therefore, create custom collection instances) and also removing the unnecessary dependency of using “*Collection” as default notation for naming Collection resources.

Goal:

Improve hydrus by adding feature of treating Collection as a resource and remove the dependency of using “*Collection” as default notation for naming Collection resources

Pull Requests:

Collection as a Resource
- [PR 488] Treating Collection as a Resource
Removing dependency of “*Collection” on naming Collection resources
- [PR 41] Discover collections in doc_maker.py
- [PR 483] Better discover collection

Extra Reference links:

Older linked issues: Treating collections as a resource - [DISCUSSION] Database design for treating collections as a resource - Discovering Classes and Collections from the APIDOC

Detailed blog posts during this phase:

[Phase 3] - More features in `hydrus`!

As the last phase started, we had covered a considerable amount of the core goals we had and now was time to close things out with a some last additions. We wanted some more features in hydrus which included removing hardcoded dependencies on vocab keyword and adding support for multiple resource type collections.

Goal:

Improve hydrus with main feature of multiple resource type collections and removing hardcoded dependency on vocab keyword and wrap up bug fixes and documentation.

Pull Requests:

Removing hardcoded dependencies on vocab keyword
- [PR 490] Remove hardcoded dependencies on vocab keyword
Adding support for multiple resource type collections
- [PR 493] Multiple resource type collections
Other minor bug fixes
- [PR 492] Rectify GET on /collection endpoint
- [PR 496] FIX: failing docker build

Extra Reference links:

Older linked issues: hydrus hardcoded with ‘vocab:’ to check foreign key relationship - Multiple resource type collections

Detailed blog posts during this phase:

Auxiliary Contributions

There were a some additional contributions made along the way and during the Community Bonding that will be mentioned for consistency but were mainly auxiliary:

hydrus

hydra-python-core

[PR 46] Add support for POST and DELETE on collection classes

documentation-hydrus

[PR 14] Update docs to latest hydrus api

http-apis.github.io

Acknowledgments

The GSoC 2020 journey now comes to an end. I have to say it was personally a really important development phase for me and an experience that I’ll remember forever. I feel now that I’ve progressed as a Software Developer and also that we were able to progress and take Hydra Ecosystem development a little bit further with a group of people spread in the world.

I have also been added the HydraEcosystem organisation. I will use this opportunity to keep contributing to HydraEcosystem[and other open source projects too;)] even after my GSoC 2020 period. I have talked to with my mentors on the other interesting stuff they are working in the organisation and how I could possibly start working on them.

And of course, all of that would not be possible without the support, time and will of my amazing mentors Chris Andrew, Akshay Dahiya and Lorenzo. Whom have been, not only this summer but for years, cultivating the Hydra Ecosystem and giving their time and effort to develop this concept and provide the opportunity for other people to join and learn. My honest thanks for the knowledge and trust you’ve shared with me, it’s something that I’ll take with me in my career to become a better professional.
I also have to give a huge shout out to my GSoC partner Priyanshu Nayan who helped me a lot whenever I was stuck debugging something, having a doubt in the Hydra spec or any help in general.

To sum it up, I think this is a start to my journey in open source software rather than end to my GSoC period.

Get in touch

Thanks for stopping by, I hope you found something useful or interesting.
If you want to contact me, you can reach out to me via my below online profiles:

Linkedin
Github
Twitter
Email me at samesh.lakhotia+work@gmail.com

GSoC 2020 - Week 12

Posted on August 26th 2020 Edited on November 21st 2020

This was the last week of my GSoC 2020. The main focus for this week was on completing any kind of left over work from previous phases, removing any bugs and adding documentation, basically, to get our work polished and ready for the next release of hydrus.

Multiple resource type collections

Majority of my time in the last two weeks was spent on adding support for Multiple resource type collections.
After PR for Treating Collection as a resource got merged, hydrus had support for treating collections as a resource.
But in that implementation, collections were restricted to only a collection of a single type class.

Also, as described in the spec, collections are described as a ‘set of somehow related resources‘. This would also mean that a collection might also be a set of classes of different types.

We need to add support for this type of collections too.

The support for this feature was added in this PR.

Example usage after changes made

1) PUT on /collection/
The request body should have the list of @ids of class instances which would be grouped into a collection.
NOTE: These instances should exist in the database before grouping them into a collection.
For eg,

{
  "@type": "LogEntryCollection",
  "members": [
    {
      "@id": "/serverapi/Drone/4ff14a9e-9cd0-4e8a-9c11-86ac9bec211f",
      "@type": "Drone"
    },
    {
      "@id": "/serverapi/LogEntry/aab38f9d-516a-4bb2-ae16-068c0c5345bd",
      "@type": "LogEntry"
    }
  ]
}

In the above example, a LogEntry instance with id(primary key) ‘aab38f9d-516a-4bb2-ae16-068c0c5345bd’ and a Drone instance with id(primary key) ‘4ff14a9e-9cd0-4e8a-9c11-86ac9bec211f’ already exist in the database.
This adds data in the ‘LogEntryCollection’ table.

Example response:

{
    "@context": "http://www.w3.org/ns/hydra/context.jsonld",
    "@type": "Status",
    "description": "Object with ID 50ed3b68-9437-4c68-93d4-b67013b9d412 successfully added",
    "statusCode": 201,
    "title": "Object successfully added."
}

2) GET on /collection/id

Example endpoint: `/LogEntryCollection/50ed3b68-9437-4c68-93d4-b67013b9d412`

Example response:

Will return all the members belonging to the collection with given collection_id. The members returned are all hydra:Links.

{
  "@context": "/serverapi/contexts/LogEntryCollection.jsonld",
  "@id": "/serverapi/LogEntryCollection/50ed3b68-9437-4c68-93d4-b67013b9d412",
  "@type": "LogEntryCollection",
  "members": [
    {
      "@id": "/serverapi/LogEntry/aab38f9d-516a-4bb2-ae16-068c0c5345bd",
      "@type": "hydra:Link"
    },
    {
      "@id": "/serverapi/Drone/4ff14a9e-9cd0-4e8a-9c11-86ac9bec211f",
      "@type": "hydra:Link"
    }
  ]
}

Other work

Other than working on the above feature, I spent fixing a minor bug in hydrus.
The PR for this fix is here.

Also, added documentation(much needed, for open source projects) on utilizing the new endpoints for managing collections. Also, updated old docs where necessary.
The PR for the same is here.

Learnings / Work done

I became more adept using the tools in the HydraEcosystem.

My PRs in this period:

Add support for multiple resource type collections - PR #493
Fix docker bug in hydrus - PR #496
Documention for hydrus - PR #82

Next Steps

As, this week marks the culmination of the period of my Google Summer of Code 2020 journey, it was quite emotional for me.
I had one of the best experiences of my life contributing to HydraEcosystem.org during this period. I learnt a lot in the last 4 months of GSoC. From working as a team to collaborating on open source software.

I have also been added the HydraEcosystem organisation. I will use this opportunity to keep contributing to HydraEcosystem[and other open source projects too;)] even after my GSoC 2020 period. I have talked to with my mentors on the other interesting stuff they are working in the organisation and how I could possibly start working on them.

To sum it up, I think this is a start to my journey in open source software rather than end to my GSoC period.

Happy contributing, y’all!

GSoC 2020 - Week 10 (Another good news!)

Posted on August 17th 2020 Edited on November 21st 2020

I passed the second evaluation in GSoC 2020. Hurray!

Also, now 10 weeks have been completed. We are in the endgame now.

Now coming to what I have been working on the last two weeks. The last two weeks were mainly spent in tackling a very big pending issue in hydrus, hydrus hardcoded with ‘vocab:’ to find relationships issue.

The Problem

At the moment, hydrus has been hardcoded with vocab: while it tries to identify foreign key relationship. For eg, here.

But the problem here is that just expanding vocab: won’t solve the underlying issue for discovering all foreign key relationships correctly. For example, we could as well have something called myownvocab: in @context node of the Hydra API doc and then that could be used to reference something like myownvocab:Class1.

We might also extract this correctly after some work at parsing the API doc. But if suppose our context becomes nested, it will become difficult to parse.

This was a really big issue to tackle, not just because of the implementation in hydrus but also because there were a lot of changes required in the hydra-python-core repo. That is because the hard coded nature of hydrus was due to to the fact that initially hydrus and hydra-python-core were very tightly coupled.

Solution

The ideal behaviour should be that, hydrus starts with an expanded API doc. We should not parse the API doc without expanding it beforehand. This change should then ideally take place it the hydra_python_core library, which generates the API doc object which hydrus uses for parsing data required of the API doc. Therefore the output of the hydra_python_core library’s doc_maker module should be an API doc which is expanded beforehand.

This dependency on the hydra-python-core made this issue a little more difficult than I initally anticipated.

Other work

Other than working on the above issue, I spent some time adding documentation on hydrus here. We all know how important good documentation is for any project, especially open source projects.

Also, there was some work needed to be done on the previous Treating collections as a resource issue.
Basically, some work had to be done on GET request on any /collection endpoint here. This work was basically an update on the work in this PR.

Learnings / Work done

As most of the work done in this period was dependent on hydra-python-core, I really learnt how to work in a team, alongside fellow GSoCer at HydraEcocsystem, Priyanshu.

It was an amazing experience as together both of us hunted down all the bugs to make sure hydrus and hydra-python-core work in sync perfectly.

My PRs in this period:

Remove hardcoded vocab: keyword - PR #490
Add docs on hydrus - PR #49
GET request on /collection endpoint - PR #492

Next Steps

After PR #488 got merged, hydrus will has support for treating collections as a resource.

But in that implementation, collections are restricted to only a collection of a single type class.
As described in the spec, collections are described as a ‘set of somehow related resources’. This would also mean that a collection might also be a set of classes of different types.

We need to add support for this type of collections too.

GSoC 2020 - Week 8

Posted on July 28th 2020 Edited on November 21st 2020

Eight weeks have elapsed of my GSoC 2020. It feels great to be working on the project as we have made a good amount of progress.

Also, the second evaluations are coming up. Fingers crossed.

Now coming to what I have been working on the last two weeks. The last two weeks were spent in tackling a very big pending issue in hydrus, Treating collections as a resource issue.

This was one of the biggest points in my proposal for GSoC 2020 and also the mentors wanted this issue to be fixed. A lot of improvements would be made after this is fixed.

The Problem

Currently in hydrus, the treatment of collections as a single resource for every class is wrong. Nowhere in the spec is it mentioned that a collection is a set of all objects for a given class. It is mentioned that a collection is “a set of somehow related resources“. This does not imply that a collection would be only limited to class types.

What this means essentially is that currently in hydrus if there is a ‘collection’ class which is a collection of instances of type x, it is just being used to store all elements of type x.

For example, looking at this Drone ApiDoc, we can see there is a hydra:Collection with title ‘DroneCollection’. And there is also a hydra:Class of type ‘Drone’. So, currently what is happening is that whenever you add a new drone to the database(via PUT request), it gets added to the DroneCollection collection. Therefore, all the drones will belong under just DroneCollection. Meaning if we do a GET request at /DroneCollection we will get the members property in the response as all the drones from the database.
But, no where in the Hydra Spec it is written that a hydra:Collection will just be the set of all members of that class(in this case, drones).

Solution

The way to go is to allow users to define collections on their own. Each collection itself would have an @id since it is a subclass of hydra:Resource. We then use the collection endpoint to relate classes/properties.

For example, seeing this in the context of the Comment and Issue classes:
The advantage of using a collection is that we could then define a CommentCollection object, where we link all Comment objects for a particular issue. We then give that CommentCollection object as the property for the Issue.

In terms of defining it in the API documentation, it would be something like:

{
        "@type": "SupportedProperty",
        "property": vocab:CommentCollection,
        "readonly": "false",
        "required": "true",
        "title": "Comments",
        "writeonly": "false"
}

The property would map to an instance of the CommanCollection class. This will be defined in the supportedProperty field for the definition of the Issue class. Like any other object, when we define the Issue object, we will add the appropriate instance of the CommandCollection to the Issue instance. So an object would be something like:

{
        "@type": "Issue",
        "@id": "/api/Issues/27",
        "Issue": "....",
        "Comments": "/api/CommentCollection/32",
        ....
}

And then /api/CommentCollection/32 would have:

{
        "@type": "CommentCollection",
        "@id": "/api/CommentCollection/32",
        "members": [
                {
                        "@id": "/api/Comment/12",
                        "@type": "Comment"
                },
                {
                        "@id": "/api/Comment/18",
                        "@type": "Comment"
                },
                {
                        "@id": "/api/Comment/22",
                        "@type": "Comment"
                }
        ]

}

Implementation details

We have planned to have a new collection_id column for every collection table so as to distinguish which instance belongs to which collection. We need new column as the values in this column could be repeated as a collection could have many items in it.

Apart from this column, the existing columns include id, which acts as the primary key for that table and members which would act as a link to the actual item in that item’s table.

Brief description of major changes after we start treating collection as a resource:

(Note: For below discussion, CommandCollection is a collection class, Command would be non-collection class / parsed class. The meaning of non-collection class would be any class which will not act as a ‘collection’ of items)

1) GET on /collection/

Example endpoint: `/CommandCollection/`

This fetches the data from the ‘CommandCollection’ table.

Example response:

Will return a list of collections.

{
    "@context": "/serverapi/contexts/CommandCollection.jsonld",
    "@id": "/serverapi/CommandCollection/",
    "@type": "CommandCollection",
    "hydra:totalItems": 3,
    "hydra:view": {
        "@id": "/serverapi/CommandCollection?page=1",
        "@type": "hydra:PartialCollectionView",
        "hydra:first": "/serverapi/CommandCollection?page=1",
        "hydra:last": "/serverapi/CommandCollection?page=1"
    },
    "members": [
        {
            "@id": "/serverapi/CommandCollection/7d2cc88f-388a-43f1-80fc-0c2184de4784",
            "@type": "CommandCollection"
        },
        {
            "@id": "/serverapi/CommandCollection/50ed3b68-9437-4c68-93d4-b67013b9d412",
            "@type": "CommandCollection"
        },
        {
            "@id": "/serverapi/CommandCollection/c2e10f88-d205-41e7-aa61-338afd64f657",
            "@type": "CommandCollection"
        }
    ]
}

2) PUT on /collection/

Example endpoint: `/CommandCollection/`

The request body should have the list of ids of class instances which would be grouped into a collection.
For eg,

{
    "@type": "CommandCollection",
    "members": ["aaaaa",
                "bbbbb",
                "ccccc"]
}

In the above example, ‘aaaaaa’, ‘bbbbb’ and ‘ccccc’ are the ids(primary key) of the instances in Command table.
This adds data in the ‘CommandCollection’ table.

Example response:

{
    "@context": "http://www.w3.org/ns/hydra/context.jsonld",
    "@type": "Status",
    "description": "Object with ID 50ed3b68-9437-4c68-93d4-b67013b9d412 successfully added",
    "statusCode": 201,
    "title": "Object successfully added."
}

NOTE: The id returned is actually the value of the collection_id column in the table, not the primary key id.
3) GET on /collection/id

Example endpoint: `/CommandCollection/50ed3b68-9437-4c68-93d4-b67013b9d412`

Note: The id is corresponding to the collection_id column in the CommandCollection table, not the primary key id.

Example response:

Will return all the members belonging to the collection with given collection_id.

{
    "@context": "/serverapi/contexts/CommandCollection.jsonld",
    "@id": "/serverapi/CommandCollectionCollection/50ed3b68-9437-4c68-93d4-b67013b9d412",
    "@type": "CommandCollection",
    "members": [
        {
            "@id": "/serverapi/Command/aaaaaa",
            "@type": "Command"
        },
        {
            "@id": "/serverapi/Command/bbbbbb",
            "@type": "Command"
        },
        {
            "@id": "/serverapi/Command/cccccc",
            "@type": "Command"
        }
    ]
}

4) POST on /collection/id

Example endpoint: `/CommandCollection/50ed3b68-9437-4c68-93d4-b67013b9d412`

The request body should have the list of members to be updated for the given collection.
For eg,

{
    "@type": "CommandCollection",
    "members": ["ddd",
                "eee",
                "fff"]
}

Note: The id is corresponding to the collection_id column in the CommandCollection table, not the primary key id.

Example response:

Will update all the members belonging to the collection with given collection_id with the members given in the request body.

{
{
    "@context": "http://www.w3.org/ns/hydra/context.jsonld",
    "@type": "Status",
    "description": "Object with ID 50ed3b68-9437-4c68-93d4-b67013b9d412 successfully updated",
    "statusCode": 200,
    "title": "Object updated"
}

5) DELETE on /collection/id

Example endpoint: `/CommandCollection/50ed3b68-9437-4c68-93d4-b67013b9d412`

Note: The id is corresponding to the collection_id column in the CommandCollection table, not the primary key id.

Example response:

Will delete that collection from the table.

{
    "@context": "http://www.w3.org/ns/hydra/context.jsonld",
    "@type": "Status",
    "description": "Object with ID 50ed3b68-9437-4c68-93d4-b67013b9d412 successfully deleted",
    "statusCode": 200,
    "title": "Object successfully deleted."
}

6) GET, PUT, POST and DELETE on any /non-collection-class

Example endpoint: `/Command/` or `/Area`

NOTE: All of these will have the same behaviour as what would have happened before on /CommandCollection/ endpoint

Learnings / Work done

Apart from again learning a lot of the codebase of hydrus, I learnt a lot on Hydra Spec.

All my work for this issue has been completed in this PR.

GSoC 2020 - Week 6

Posted on July 13th 2020 Edited on November 21st 2020

So, six weeks have elapsed of my GSoC 2020. It feels great to be working on the project as we have made a good amount of progress.

Also, the first evaluations had been completed and I had successfully passed them. That’s good!

Now coming to what I have been working on the last two weeks. Initially the plan was to start working on the Treating collections as a resource issue in hydrus.

But after discussion with my mentors, they suggested me to work on two smaller issues before tackling the above issue as that is a big one.

The problems

1. Use of “*Collection” as default notation for Collection resources

In a lot of places hydrus assumes that the collection for a certain resource will be of the name “[Classname]Collection”. This has been added in a lot of places in the code as well(hardcoded). For example here.

Ideally, Collection endpoint and resource can use any “name” and we should not limit it to the current format.

Solution

Collections have to be uniquely identified from the APIDOC. Also, the parsing of resources from the API Doc is actually taken care by the hydra-python-core module. So, the actual problem here lies in that module rather than hydrus.

After going through the hydra specs, we understood that collections in an APIDOC can be identified through two ways.

All collections have to have an inline ‘manages’ block. The manages block is a way to add information about the members of that collections.
All collections can be provided in the hydra:collection property of the EntryPoint.

Both of these cases can be understood from this example.

The work I have done is in this PR: https://github.com/HTTP-APIs/hydra-python-core/pull/41

There was also complementary work done by in this PR by another fellow GSoC participant, Priyanshu.

2. hydrus hardcoded with ‘vocab:’ to check foreign key relationship

At the moment, hydrus has been hardcoded with vocab: while it tries to identify foreign key relationship. For eg, here.

But, we should not hardcode the method to detect any kind of relationships.
We(in hydrus) should actually expand the vocab: keyword and also check that for identifying foreign key relationship. For eg, something like https://www.markus-lanthaler.com/hydra/event-api/vocab#Class1.

But the problem here is that just expanding vocab: won’t solve the underlying issue for discovering all foreign key relationships correctly. For example, we could as well have something called myownvocab: in @context node of the Hydra API doc and then that could be used to reference something like myownvocab:Class1.

We might also extract this correctly after some work at parsing the API doc. But if suppose our context becomes nested, it will become difficult to parse.

This issue is actually directly related to this existing issue in the hydra-python-core library.

Solution

The ideal behaviour should be that, hydrus starts with an expanded API doc. We should not parse the API doc without expanding it beforehand.

This change should then ideally take place it the hydra_python_core library, which generates the API doc object which hydrus uses for parsing data required of the API doc. Therefore the output of the hydra_python_core librarie’s doc_maker module should be an API doc which is expanded beforehand.

For working on this, Priyanshu started with the changes in the core module in his PR.
I then started with doing necessary changes in hydrus after these changes in the core library.

Learnings

I learnt a lot more on the Hydra Spec solving the above two issues. Both of the above issues needed a lot background understanding of the spec so we could implement the required solutions in hydrus and more importantly the hydra-python-core module.

GSoC 2020 - Week 4 (Some good news too!)

Posted on June 28th 2020 Edited on November 21st 2020

Hi y’all!

Its been 4 weeks into the coding period of my GSoC 2020.

First off, I have some good news to share:

The task of creation of database schema where each resource has its own table, by reading the ApiDoc on runtime has been completed!

Previous schema

New schema

As the new schema depends on the ApiDoc, the below schema is for this Drone ApiDoc.

The ER Diagram isn’t that clear because I didn’t use any tool to draw it; I generated it using a tool called ERAlchemy. Though, I will update it with a better one, ASAP.

I will explain the important bits here:

All the tables are for the resources defined in the Drone ApiDoc.
The lines between them are foreign key constraints. These were established by checking if any of the supportedProperty of that resource are referencing any other resource. This could happen by either referencing them through the property attribute of that supportedProperty (for eg, ‘vocab:State’) or through defining hydra:Link in the property attribute of that supportedProperty. The range of that hydra:Link would dictate the resource to which the foreign key is to be established (for eg, ‘range’: ‘vocab:State’).
The State column in the Command table is acting like a foreign key to State table.
The DroneState column in the Drone table is acting like a foreign key to State table.
The State column, Data column, Command column in the LogEntry table are acting like a foreign keys to State table, Datastream table and Command table respectively.

Benefits of this new database architecture

As this database architecture makes a table for each resource, this really improves its scalability and efficiency.
The database operations are much more efficient now.
The codebase (CRUD operations part) has become less complex. As now all the linking for RDF triples are done by the inherent database tables. Before, the schema was really generic, which meant, overhead on the developer’s side to implement all the CRUD operations. But now, as the database has simplified to tables for each resource, this means that overhead is taken care by SQL-Alchemy. And we can get away with a very simple statement of the kind table.insert(value)(Thank you SQL-Alchemy).
Fun fact: If you will observe my PR for this feature, you will observe that in net total, there is actually code that is removed from the codebase! Even after implementation of such big feature. This goes to prove the fact that this feature, bringing more optimisation to hydrus, infact reduces its complexity.

Learnings

It took me a little more time than I anticipated for implementing this feature as the previous code for database operation was really tighly coupled to existing schema. There was not that much abstraction between the under the hood database queries and the higher level logic for parsing data.

This taught me how to write better decoupled code so the project becomes maintainable on a longer run.

Also, after completing this feature, I have become confident of ~97-99% of the hydrus codebase. As during implementation, I faced a lot of bugs in between which lead to many pdb sessions through which I navigated through deep stack traces and essentially going over a major chunk of the hydrus codebase, line by line.

I also faced many bugs in between and after spending some time trying to debug, Python’s pass by reference hit right on my face (facepalm!)

All the work I have done for this feature is in this PR.

Path Ahead

The Phase 1 evaluations are coming up tomorrow. Fingers crossed.

Also, the next problem we would like to tackle is the Treating collections as a resource in hydrus. Looking forward towards going about it!

GSoC 2020 - Week 2

Posted on June 21st 2020 Edited on November 21st 2020

So, first two weeks of my GSoC have been completed.

My project is basically on improving the Hydra server in Python called hydrus.
My first task was to improve the existing database architecture that hydrus used internally while serving any ApiDoc.
This task was related to this issue.

The existing schema:

This design for the database was really generic.

Our aim was to have a database multi-table architecture where different resources are stored in different tables to improve scalability and efficiency. The result we wanted to achieve was that there will be should be a table for each hydra:Class and hydra:Collection defined in the ApiDoc.

For example, if we consider the Drone ApiDoc, we would want a State table, Drone table and so on. The columns would be the properties defined under the supportedProperty for that Class or collection.
This would easily facilitate the operations on the data as for any operation such as updating a new record or deleting a old record, we would just need to call the methods provided by Flask-SQLAlchemy library. A lot of overhead which we had in our previous database schema would be minimised.

For example, in the previous database schema, for a simple operation such as adding a Message Instance, (by doing a PUT request to /MessageCollection) , we would need to do an INSERT operation in 4 tables, namely the instance table, terminals table, graphiit table, graph table. That is not that efficient.

In the new schema, there will be a separate table for storing all instances of Message class. This table will have a primary key column and a column with name MessageString (from the supportedProperty of Message class). So, to modify any instance of Message class, we will only need to do a single operation on this on table.

This will make our database operations much more efficient.

To achieve this feat, I had to read the ApiDoc and then make the database specific to that ApiDoc.

I had spent the first week reading on hydrus‘s existing schema and understanding on how we are storing the data. This made it more clear to me on how I could optimize the database architecture.

In the second week, I had started implementing the logic for parsing the ApiDoc and making the tables from reading that. I have completed till making the tables “dynamically”.
In the next week, I will look into how we can get ‘linking’ behaviour in the tables, meaning how would we connect tables with foreign keys by just parsing the ApiDoc.

The work I have done is in this PR: https://github.com/HTTP-APIs/hydrus/pull/479

Learnings

I learnt a lot on how we go about database optimisation.

I also learnt how to “dynamically” create tables in a database using Flask-SQLAlchemy library. The challange was mainly to create Python classes on runtime. Because all the operations on the tables will be done from SQLAlchemy classes.

Surprisingly, Python’s unasumming type function was the backbone for making classes on runtime.
This blog post also came very handy.

Community Bonding Period - Google Summer of Code 2020

Posted on June 6th 2020 Edited on November 21st 2020

The community bonding period is basically meant for students to get to know mentors, read documentation, interact more with the organisation’s community. Basically, to get you up to speed before the coding period starts.

So, for almost the past one month, I have been fixing minor bugs and writing documentation for the HydraEcosystem to get more used to the tools alongside hydrus in the HydraEcosystem. This helped me learn a ton about semantic web, JSON-LD, RDF, linked-data and Hydra.

What we, at Hydra, are working on is to automate REST APIs. We are working on building better Web APIs and ‘smarter’ clients.

Automate? Better web APIs? Smarter clients?

Even today, building APIs is more like an art than science. Today’s clients are heavily hard-coded against specific APIs. This makes the clients very brittle, meaning when the API even changes slightly, the clients will break.

Let’s take a example to understand this better.
Suppose you are using a using a football statistics API which gives stats for a particular team in the ongoing season. Right now it serves stats for only the teams playing in the English Premier League.
They serve the stats for Liverpool at http://myamazingfootballstats.com/liverpool. Now, as they expand, they have started serving data even for Spanish La Liga. So now, they have changed their API and have started serving stats for Liverpool at http://myamazingfootballstats.com/premierleague/liverpool.
This simple change is going to break all the clients relying on this API for data. This will lead us to spend a lot of time to manually reconfigure all our clients to use the new API.

This is where Hydra comes into picture. Imagine, if there was a way to document our API, like all the endpoints it serves, the operations allowed on those endpoints, the format of data required on those endpoints, etc so that machines could understand this. Also, smart clients which could understand the API documentation and decide on how the data has to retrieved. We don’t need to hardcode them. And also, smart servers, which given just this documentation, know how to set them up and start serving data. You don’t need to do any setup in server such as setting up database, the server understands everything from the documentation.

Hydra is a specific type of JSON Linked Data representation that was proposed by the W3C community. You can think of it as a standard ‘vocabulary’ between clients and servers. So, the client can understand the state of the server, the available endpoints, methods allowed on those endpoints and all the necessary information from the Hydra’s API documentation. This documentation, is served at a URL, which can be identified by the client from the response of the server.

So, once the client is given the url of the server, it gets the API documentation. After that, it automagically find all the information regarding the endpoints the server is hosting by the api documentation which is needed to interact with the server.

Therefore, Hydra, is set of technologies that allow to design APIs in a different manner, in a way that enables smarter clients and more generic servers.

I am mainly going to work on the Hydra server called as hydrus and my fellow GSoCer at Hydra, Priyanshu is going to work on the client side part of Hydra, Hydra Agent.

My Work

My time in the community bonding period was mostly spent on learning about all the tools in the Hydra Ecosystem. I learnt a lot on how the Hydra documentation works and how smart clients understand it. I also spent a lot of time reading the Hydra Spec which acts like the core vocabularly which makes all of this possible.

My PRs during this period which include working on improving documentation, refactoring existing code and fixing minor bugs:

My mentors Akshay and Chris helped me a lot with my doubts. I also spent a lot of my time, discussing about Hydra stuff with Priyanshu.

Path ahead

During the first two weeks, I plan to research and implement a better database architecture for hydrus. Right now, the database architecture is very generic. We could make a lot of changes which will improve scalability and efficiency.

Hello World

Posted on June 5th 2020 Edited on November 21st 2020

Hello everyone!
I am going to share my Google Summer of Code (GSoC) 2020 experience here.

More about me here

Something I would want you all to know beforehand

How do I start?

Prerequisites:

What next?

Now what?

[Phase 1] - hydrus database architecture improvements

Goal:

Pull Requests:

Extra Reference links:

Detailed blog posts during this phase:

[Phase 2] - Treating collections as a Resource

Goal:

Pull Requests:

Extra Reference links:

Detailed blog posts during this phase:

[Phase 3] - More features in hydrus!

Goal:

Pull Requests:

Extra Reference links:

Detailed blog posts during this phase:

Auxiliary Contributions

hydrus

hydra-python-core

documentation-hydrus

http-apis.github.io

Acknowledgments

Get in touch

Multiple resource type collections

Example usage after changes made

Example response:

Example endpoint: /LogEntryCollection/50ed3b68-9437-4c68-93d4-b67013b9d412

Example response:

Other work

Learnings / Work done

Next Steps

The Problem

Solution

Other work

Learnings / Work done

Next Steps

The Problem

Solution

Implementation details

Brief description of major changes after we start treating collection as a resource:

Example endpoint: /CommandCollection/

Example response:

Example endpoint: /CommandCollection/

Example response:

Example endpoint: /CommandCollection/50ed3b68-9437-4c68-93d4-b67013b9d412

Example response:

Example endpoint: /CommandCollection/50ed3b68-9437-4c68-93d4-b67013b9d412

Example response:

Example endpoint: /CommandCollection/50ed3b68-9437-4c68-93d4-b67013b9d412

Example response:

Example endpoint: /Command/ or /Area

Learnings / Work done

The problems

Solution

Solution

Learnings

Previous schema

New schema

Benefits of this new database architecture

Learnings

Path Ahead

Learnings

Automate? Better web APIs? Smarter clients?

My Work

Path ahead

[Phase 1] - `hydrus` database architecture improvements

[Phase 3] - More features in `hydrus`!

Example endpoint: `/LogEntryCollection/50ed3b68-9437-4c68-93d4-b67013b9d412`

Example endpoint: `/CommandCollection/`

Example endpoint: `/CommandCollection/`

Example endpoint: `/CommandCollection/50ed3b68-9437-4c68-93d4-b67013b9d412`

Example endpoint: `/CommandCollection/50ed3b68-9437-4c68-93d4-b67013b9d412`

Example endpoint: `/CommandCollection/50ed3b68-9437-4c68-93d4-b67013b9d412`

Example endpoint: `/Command/` or `/Area`