Big data is all the rage, and as companies scramble to gather as much data as they possibly can, the eventual outcome is typically that you end up with more data than you know what to do with. So in this lake filled with files and objects, it's very easy for data to get lost. This is where metadata comes into play. In a simple way, metadata is additional information about your data, both automatically and manually created.
For example, in the case of a picture, the automatic metadata can include the exact time the picture was taken, the type of camera that was used, the GPS location and the resolution, whereas some of the manual metadata can include the name of the author and a description. Having this type of metadata attached to every piece of data you store is crucial, especially as you scale up. One challenge that often presents itself is how to store this metadata, since most file formats don't allow you to add this metadata to the file itself.
Using a database
One common method to store metadata is by using a database. One of the advantages of this method is that you can define strict guidelines that must be followed for each data types. For example, in the case of pictures, you can create a database table that includes columns such as the location of the picture and the author, and force any picture that gets added to your data lake to include this information. One of the drawbacks however is that a relational database tends to be less flexible. If you want to add or remove fields later on, you need to go and change the schema of the table. The relationship between the table entry and the actual piece of data may also get lost over time, since you always need to double every single operation. Someone may delete a file but forget to delete its related database entry, or a file may get overwritten by a completely different type of data but the database entry doesn't get updated.
Using a document template
Having a metadata document follow every piece of content, usually in JSON format, is much more flexible. For example, if you have a folder with a bunch of files, you should ensure that next to each file is a similarly named file with a .json extension which contains the metadata. It's far easier to see the relationship, and if you want to change the content of the metadata, it's easy to update the JSON file since there's no schema to adhere to. However, having a large amount of additional files may incur a performance hit, and trying to search through all these files will likely be much slower than using a database. On possible way around that is to store your documents in a NoSQL database.
Using custom tags
Many modern cloud platforms have solved this issue by introducing the concept of tags. For example, when you store an object in AWS S3, you can add any number of tags to that object. This means you only have one entry to update, the object itself, and the tags always follow it. You don't have to keep track of both the file and an additional JSON document or database entry. Tags are also flexible since you can add any number of key/value pairs. The disadvantage here is that tags are an additional paid feature. AWS charges a very small amount to add tags, but if you have billions of files, that amount becomes significant.
At the end of the day, there is no correct way to store metadata. The method you use will depend on your needs. The important thing is to have it. Trying to manage a data lake without metadata is an exercise in futility. Keeping as much of the original metadata as possible is crucial, and finding the best way to do that should be a part of your initial design, before you even consider building a new data lake.